Categorygithub.com/go-corelibs/rxp

modulepackage

0.10.1

Repository: https://github.com/go-corelibs/rxp.git

Documentation: pkg.go.dev

# README

rxp

rxp is an experiment in doing regexp-like things, without actually using regexp to do any of the work.

For most use cases, the regexp package is likely the correct choice as it is fairly optimized and uses the familiar regular expression byte/string patterns to compile and use to match and replace text.

rxp by contrast doesn't really have a compilation phase, rather it is simply the declaration of a Pattern, which is really just a slice of Matcher functions, and to do the neat things one needs to do with regular expressions, simply use the methods on the Pattern list.

Notice

This is the v0.10.x series, it works but likely not exactly as one would expect. For example, the greedy-ness of things is incorrect, however, there are always ways to write the patterns differently such that the greedy-ness issue is irrelevant.

Please do not blindly use this project without at least writing specific unit tests for all Patterns and methods required.

There are no safeguards against footguns and other such pitfalls.

Installation

> go get github.com/go-corelibs/rxp@latest

Examples

Find all words at the start of any line of input

// regexp version:
m := regexp.
    MustCompile(`(?:m)^\s*(\w+)\b`).
    FindAllStringSubmatch(input, -1)

// equivalent rxp version
m := rxp.Pattern{}.
    Caret("m").S("*").W("+", "c").B().
    FindAllStringSubmatch(input, -1)

Perform a series of text transformations

For whatever reason, some text needs to be transformed and these transformations must satisfy four requirements: lowercase everything, consecutive spaces become one space, single quotes must be turned into underscores and all non-alphanumeric-underscore-or-spaces be removed.

These requirements can be explored with the traditional Perl substitution syntax, as in the following table:

#	Perl Expression	Description
1	s/[A-Z]+/\L${1}\e/mg	lowercase all letters
2	s/\s+/ /mg	collapse all spaces
3	s/[']/_/mg	single quotes to underscores
4	s/[^\w\s]+//mg	delete non-word-or-spaces

The result of the above should take: Isn't this neat? and transform it into: isn_t this neat.

// using regexp:
output := strings.ToLower(`Isn't  this  neat?`)
output = regexp.MustCompile(`\s+`).ReplaceAllString(output, " ")
output = regexp.MustCompile(`[']`).ReplaceAllString(output, "_")
output = regexp.MustCompile(`[^\w ]`).ReplaceAllString(output, "")
// output == "isn_t this neat"

// using rxp:
output := rxp.Pipeline{}.
	Transform(strings.ToLower).
	Literal(rxp.S("+"), " ").
	Literal(rxp.Text("'"), "_").
	Literal(rxp.Not(rxp.W(), rxp.S(), "c"), "").
	Process(`Isn't  this  neat?`)
// output == "isn_t this neat"

Benchmarks

These benchmarks can be regenerated using make benchmark.

Historical (make benchstats-historical)

Given that performance is basically the entire point of the rxp package, here's some benchmark statistics showing the evolution of the rxp package itself from v0.1.0 to the current v0.10.0. Each of these releases are present in separate pre-release branches so that curious developers can easily study the progression of this initial development cycle.

goos: linux
goarch: arm64
pkg: github.com/go-corelibs/rxp
                     │     v0.1.0      │                 v0.2.0                  │                 v0.4.0                  │                 v0.8.0                  │                 v0.10.0                 │
                     │     sec/op      │     sec/op       vs base                │     sec/op       vs base                │     sec/op       vs base                │     sec/op       vs base                │
_FindAllString_Rxp      0.004292n ± 0%    0.003496n ± 1%  -18.56% (n=50)            0.002005n ± 0%  -53.30% (n=50)            0.001868n ± 1%  -56.48% (n=50)            0.002162n ± 1%  -49.62% (n=50)
_Pipeline_Combo_Rxp    0.0002862n ± 1%   0.0002866n ± 1%        ~ (p=0.920 n=50)   0.0002945n ± 3%   +2.86% (p=0.037 n=50)   0.0002910n ± 2%   +1.66% (p=0.010 n=50)   0.0002858n ± 1%        ~ (p=0.348 n=50)
_Pipeline_Readme_Rxp    0.047985n ± 0%    0.053185n ± 1%  +10.84% (p=0.000 n=50)    0.010055n ± 0%  -79.05% (n=50)            0.006152n ± 0%  -87.18% (n=50)            0.007117n ± 1%  -85.17% (n=50)
_Replace_ToUpper_Rxp     0.07639n ± 1%     0.08496n ± 1%  +11.22% (p=0.000 n=50)     0.03293n ± 0%  -56.90% (n=50)             0.02835n ± 0%  -62.89% (n=50)             0.02339n ± 1%  -69.38% (n=50)
geomean                 0.008192n         0.008203n        +0.13%                   0.003739n       -54.36%                   0.003120n       -61.91%                   0.003185n       -61.12%

Versus Regexp (make benchstats-regexp)

These benchmarks are loosely comparing regexp with rxp in "as similar as can be done" cases. While rxp seems to outperform regexp, note that a poorly crafted rxp Pattern can easily tank performance, as in the pipeline readme case below.

goos: linux
goarch: arm64
pkg: github.com/go-corelibs/rxp
                 │     regexp      │                   rxp                   │
                 │     sec/op      │     sec/op       vs base                │
_FindAllString      0.005376n ± 0%    0.002162n ± 1%  -59.78% (n=50)
_Pipeline_Combo    0.0028000n ± 1%   0.0002858n ± 1%  -89.79% (n=50)
_Pipeline_Readme    0.005760n ± 1%    0.007117n ± 1%  +23.56% (p=0.000 n=50)
_Replace_ToUpper     0.02497n ± 0%     0.02339n ± 1%   -6.29% (p=0.000 n=50)
geomean             0.006821n         0.003185n       -53.31%

Go-CoreLibs

Go-CoreLibs is a repository of shared code between the Go-Curses and Go-Enjin projects.

License

Copyright 2024 The Go-CoreLibs Authors

Licensed under the Apache License, Version 2.0 (the "License");
you may not use file except in compliance with the License.
You may obtain a copy of the license at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

# Functions

A creates a Matcher equivalent to the regexp [\A].

Alnum

Alnum creates a Matcher equivalent to [:alnum:].

Alpha

Alpha creates a Matcher equivalent to [:alpha:].

Ascii

Ascii creates a Matcher equivalent to [:ascii:].

B creates a Matcher equivalent to the regexp [\b].

BackRef

BackRef is a Matcher equivalent to Perl backreferences where the gid argument is the match group to use BackRef will panic if the gid argument is less than one.

Blank

Blank creates a Matcher equivalent to [:blank:].

Caret

Caret creates a Matcher equivalent to the regexp caret [^].

Cntrl

Cntrl creates a Matcher equivalent to [:cntrl:].

D creates a Matcher equivalent to the regexp \d.

Digit

Digit creates a Matcher equivalent to [:digit:].

Dollar

Dollar creates a Matcher equivalent to the regexp [$].

Dot

Dot creates a Matcher equivalent to the regexp dot (.).

Graph

Graph creates a Matcher equivalent to [:graph:].

Group

Group processes the list of Matcher instances, in the order they were given, and stops at the first one that does not match, discarding any consumed runes.

IsAtLeastSixDigits

IsAtLeastSixDigits creates a Matcher equivalent to: (?:\A[0-9]{6,}\z).

IsFieldKey

IsFieldKey creates a Matcher equivalent to: (?:\b[a-zA-Z][-_a-zA-Z0-9]+?[a-zA-Z0-9]\b) IsFieldKey is intended to validate CSS and HTML attribute key names such as "data-thing" or "some_value".

IsFieldWord

IsFieldWord creates a Matcher equivalent to: (?:\b[a-zA-Z0-9]+?[-_a-zA-Z0-9']*[a-zA-Z0-9]+\b|\b[a-zA-Z0-9]+\b).

IsHash10

IsHash10 creates a Matcher equivalent to: (?:[a-fA-F0-9]{10}).

IsKeyword

IsKeyword is intended for Go-Enjin parsing of simple search keywords from user input and creates a Matcher equivalent to: (?:\b[-+]?[a-zA-Z][-_a-zA-Z0-9']+?[a-zA-Z0-9]\b).

IsUnicodeRange

IsUnicodeRange creates a Matcher equivalent to the regexp \pN where N is a unicode character class, passed to IsUnicodeRange as a unicode.RangeTable instance For example, creating a Matcher for a single braille character: IsUnicodeRange(unicode.Braille).

IsUUID

IsUUID creates a Matcher equivalent to: (?:[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}).

Lower

Lower creates a Matcher equivalent to [:lower:].

MakeMatcher

MakeMatcher creates a rxp standard Matcher implementation wrapped around a given RuneMatcher.

NamedClass

NamedClass creates a Matcher equivalent to the regexp [:AsciiNames:], see the AsciiNames constants for the list of supported ASCII class names NamedClass will panic if given an invalid class name.

NewInputReader

NewInputReader creates a new InputReader instance for the given input string.

Not

Not processes all the matchers given, in the order they were given, stopping at the first one that succeeds and inverts the proceed return value Not is equivalent to a negated character class in traditional regular expressions, for example: [^xyza-f] could be implemented as any of the following: // slower due to four different matchers being present Not(Text("x"),Text("y"),Text("z"),R("a-f")) // better but still has two matchers Not(R("xyza-f")) // no significant difference from the previous Or(R("xyza-f"), "^") //< negation (^) flag // simplified to just one matcher present R("xyza-f", "^") //< negation (^) flag here's the interesting bit about rxp though, if speed is really the goal, then the following would capture single characters matching [^xyz-af] with significant performance over MakeMatcher based matchers (use Pattern.Add to include the custom Matcher) func(scope Flags, reps Reps, input *InputReader, index int, sm [][2]int) (scoped Flags, consumed int, proceed bool) { scoped = scope if r, size, ok := input.Get(index); ok { // test for [xyza-f] proceed = (r >= 'x' && r <= 'z') || (r >= 'a' && r <= 'f') // and invert the result proceed = !proceed if proceed { // true means the negation is a match // MatchedFlag is required, CaptureFlag optionally if needed scoped |= MatchedFlag | CaptureFlag // consume this rune's size if a capture group is needed // using size instead of just 1 will allow support for // accurate []rune, []byte and string processing consumed += size } return }.

Or processes the list of Matcher instances, in the order they were given, and stops at the first one that returns a true next Or accepts Pattern, Matcher and string types and will panic on all others.

ParseFlags

ParseFlags parses a regexp-like option string into a Flags instance and two integers, the low and high range of repetitions | Flags | Description | |---------|-----------------------------------------------------------------------------------------| | ^ | Invert the meaning of this match group | | m | Multiline mode Caret and Dollar match begin/end of line in addition to begin/end text | | s | DotNL allows Dot to match newlines (\n) | | i | AnyCase is case-insensitive matching of unicode text | | c | Capture allows this Matcher to be included in Pattern substring results | | * | zero or more repetitions, prefer more | | + | one or more repetitions, prefer more | | ? | zero or one repetition, prefer one | | {l,h} | range of repetitions, l minimum and up to h maximum, prefer more | | {l,} | range of repetitions, l minimum, prefer more | | {l} | range of repetitions, l minimum, prefer more | | *? | zero or more repetitions, prefer less | | +? | one or more repetitions, prefer less | | ?? | zero or one repetition, prefer zero | | {l,h}? | range of repetitions, l minimum and up to h maximum, prefer less | | {l,}? | range of repetitions, l minimum, prefer less | | {l}? | range of repetitions, l minimum, prefer less | The flags presented above can be combined into a single string argument, or can be individually given to ParseFlags Any parsing errors will result in a runtime panic.

ParseOptions

ParseOptions accepts Pattern, Matcher and string options and recasts them into their specific types ParseOptions will panic with any type other than Pattern, Matcher or string.

Print creates a Matcher equivalent to [:print:].

Punct

Punct creates a Matcher equivalent to [:punct:].

R creates a Matcher equivalent to regexp character class ranges such as: [xyza-f] where x, y and z are individual runes to accept and a-f is the inclusive range of letters from lowercase a to lowercase f to accept Note: do not include the [] brackets unless the intent is to actually accept those characters.

RuneIsALNUM

RuneIsALNUM returns true for alphanumeric characters [a-zA-Z0-9].

RuneIsALPHA

RuneIsALPHA returns true for alphanumeric characters [a-zA-Z].

RuneIsASCII

RuneIsASCII returns true for valid ASCII characters [\x00-\x7F].

RuneIsBLANK

RuneIsBLANK returns true for tab and space characters [\t ].

RuneIsCNTRL

RuneIsCNTRL returns true for control characters [\x00-\x1F\x7F].

RuneIsDIGIT

RuneIsDIGIT returns true for number digits [0-9].

RuneIsGRAPH

RuneIsGRAPH returns true for graphical characters [a-zA-Z0-9!"$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~] Note: upon the first use of RuneIsGRAPH, a lookup map is cached in a global variable and used for detecting the specific runes supported by the regexp [:graph:] class.

RuneIsLOWER

RuneIsLOWER returns true for lowercase alphabetic characters [a-z].

RuneIsPRINT

RuneIsPRINT returns true for space and RuneIsGRAPH characters [ [:graph:]] Note: uses RuneIsGRAPH.

RuneIsPUNCT

RuneIsPUNCT returns true for punctuation characters [!-/:-@[-`{-~].

RuneIsSpace

RuneIsSpace returns true for space characters [\t\n\f\r ].

RuneIsSPACE

RuneIsSPACE returns true for empty space characters [\t\n\v\f\r ].

RuneIsUPPER

RuneIsUPPER returns true for lowercase alphabetic characters [A-Z].

RuneIsWord

RuneIsWord returns true for word characters [_a-zA-Z0-9].

RuneIsXDIGIT

RuneIsXDIGIT returns true for hexadecimal digits [z-fA-F0-9].

S creates a Matcher equivalent to the regexp \s.

Space

Space creates a Matcher equivalent to [:space:].

Text

Text creates a Matcher for the plain text given.

Upper

Upper creates a Matcher equivalent to [:upper:].

W creates a Matcher equivalent to the regexp \w.

Word

Word creates a Matcher equivalent to [:word:].

WrapMatcher

WrapMatcher creates a Matcher using MakeMatcher and wrapping a RuneMatcher.

Xdigit

Xdigit creates a Matcher equivalent to [:xdigit:].

Z is a Matcher equivalent to the regexp [\z].

# Constants

ALNUM

No description provided by the author

ALPHA

No description provided by the author

AnyCaseFlag

No description provided by the author

ASCII

No description provided by the author

BLANK

No description provided by the author

CaptureFlag

No description provided by the author

CNTRL

No description provided by the author

DefaultFlags

No description provided by the author

DefaultMaxReps

No description provided by the author

DefaultMinReps

No description provided by the author

DIGIT

No description provided by the author

DotNewlineFlag

No description provided by the author

GRAPH

No description provided by the author

LessFlag

No description provided by the author

LOWER

No description provided by the author

MatchedFlag

No description provided by the author

MultilineFlag

No description provided by the author

NegatedFlag

No description provided by the author

OneOrMoreFlag

No description provided by the author

PUNCT

No description provided by the author

SPACE

No description provided by the author

UPPER

No description provided by the author

WORD

No description provided by the author

XDIGIT

No description provided by the author

ZeroOrMoreFlag

No description provided by the author

ZeroOrOneFlag

No description provided by the author

# Variables

LookupAsciiClass

No description provided by the author

# Structs

InputReader

InputReader is an efficient rune based buffer.

Stage

Stage is one phase of a text replacement Pipeline and receives an input string from the previous stage (or the initial input text) and returns the output provided to the next stage (or is finally returned to the caller).

# Type aliases

AsciiNames

No description provided by the author

Flags

No description provided by the author

Matcher

Matcher is a single string matching function | Argument | Description | |----------|------------------------------------| | scope | current Flags for this iteration | | reps | min and max repetition settings | | input | input rune slice (do not modify!) | | index | current input rune index to match | | Return | Description | |----------|------------------------------------| | scoped | possibly modified sub-match scope | | consumed | number of runes matched from index | | proceed | success, keep matching for more |.

Pattern

Pattern is a list of Matcher functions, all of which must match, in the order present, in order to consider the Pattern to match.

Pipeline

Pipeline is a list of stages for transforming strings in a single procedure.

Replace

Replace is a Replacer pipeline.

Replacer

Replacer is a Replace processor function The captured argument is the result of the Pattern match process and is composed of the entire matched text as the first item in the captured list, and any Pattern capture groups following The modified string argument is the output of the previous Replacer in the Replace process, or the original matched input text if this is the first Replacer in the process.

Reps

No description provided by the author

RuneMatcher

RuneMatcher is the signature for the basic character matching functions such as RuneIsWord Implementations are expected to operate using the least amount of CPU instructions possible.

Transform

Transform is the function signature for non-rxp string transformation pipeline stages.