Categorygithub.com/go-corelibs/rxp
modulepackage
0.10.1
Repository: https://github.com/go-corelibs/rxp.git
Documentation: pkg.go.dev

# README

godoc codecov Go Report Card

rxp

rxp is an experiment in doing regexp-like things, without actually using regexp to do any of the work.

For most use cases, the regexp package is likely the correct choice as it is fairly optimized and uses the familiar regular expression byte/string patterns to compile and use to match and replace text.

rxp by contrast doesn't really have a compilation phase, rather it is simply the declaration of a Pattern, which is really just a slice of Matcher functions, and to do the neat things one needs to do with regular expressions, simply use the methods on the Pattern list.

Notice

This is the v0.10.x series, it works but likely not exactly as one would expect. For example, the greedy-ness of things is incorrect, however, there are always ways to write the patterns differently such that the greedy-ness issue is irrelevant.

Please do not blindly use this project without at least writing specific unit tests for all Patterns and methods required.

There are no safeguards against footguns and other such pitfalls.

Installation

> go get github.com/go-corelibs/rxp@latest

Examples

Find all words at the start of any line of input

// regexp version:
m := regexp.
    MustCompile(`(?:m)^\s*(\w+)\b`).
    FindAllStringSubmatch(input, -1)

// equivalent rxp version
m := rxp.Pattern{}.
    Caret("m").S("*").W("+", "c").B().
    FindAllStringSubmatch(input, -1)

Perform a series of text transformations

For whatever reason, some text needs to be transformed and these transformations must satisfy four requirements: lowercase everything, consecutive spaces become one space, single quotes must be turned into underscores and all non-alphanumeric-underscore-or-spaces be removed.

These requirements can be explored with the traditional Perl substitution syntax, as in the following table:

#Perl ExpressionDescription
1s/[A-Z]+/\L${1}\e/mglowercase all letters
2s/\s+/ /mgcollapse all spaces
3s/[']/_/mgsingle quotes to underscores
4s/[^\w\s]+//mgdelete non-word-or-spaces

The result of the above should take: Isn't this neat? and transform it into: isn_t this neat.

// using regexp:
output := strings.ToLower(`Isn't  this  neat?`)
output = regexp.MustCompile(`\s+`).ReplaceAllString(output, " ")
output = regexp.MustCompile(`[']`).ReplaceAllString(output, "_")
output = regexp.MustCompile(`[^\w ]`).ReplaceAllString(output, "")
// output == "isn_t this neat"

// using rxp:
output := rxp.Pipeline{}.
	Transform(strings.ToLower).
	Literal(rxp.S("+"), " ").
	Literal(rxp.Text("'"), "_").
	Literal(rxp.Not(rxp.W(), rxp.S(), "c"), "").
	Process(`Isn't  this  neat?`)
// output == "isn_t this neat"

Benchmarks

These benchmarks can be regenerated using make benchmark.

Historical (make benchstats-historical)

Given that performance is basically the entire point of the rxp package, here's some benchmark statistics showing the evolution of the rxp package itself from v0.1.0 to the current v0.10.0. Each of these releases are present in separate pre-release branches so that curious developers can easily study the progression of this initial development cycle.

goos: linux
goarch: arm64
pkg: github.com/go-corelibs/rxp
                     │     v0.1.0      │                 v0.2.0                  │                 v0.4.0                  │                 v0.8.0                  │                 v0.10.0                 │
                     │     sec/op      │     sec/op       vs base                │     sec/op       vs base                │     sec/op       vs base                │     sec/op       vs base                │
_FindAllString_Rxp      0.004292n ± 0%    0.003496n ± 1%  -18.56% (n=50)            0.002005n ± 0%  -53.30% (n=50)            0.001868n ± 1%  -56.48% (n=50)            0.002162n ± 1%  -49.62% (n=50)
_Pipeline_Combo_Rxp    0.0002862n ± 1%   0.0002866n ± 1%        ~ (p=0.920 n=50)   0.0002945n ± 3%   +2.86% (p=0.037 n=50)   0.0002910n ± 2%   +1.66% (p=0.010 n=50)   0.0002858n ± 1%        ~ (p=0.348 n=50)
_Pipeline_Readme_Rxp    0.047985n ± 0%    0.053185n ± 1%  +10.84% (p=0.000 n=50)    0.010055n ± 0%  -79.05% (n=50)            0.006152n ± 0%  -87.18% (n=50)            0.007117n ± 1%  -85.17% (n=50)
_Replace_ToUpper_Rxp     0.07639n ± 1%     0.08496n ± 1%  +11.22% (p=0.000 n=50)     0.03293n ± 0%  -56.90% (n=50)             0.02835n ± 0%  -62.89% (n=50)             0.02339n ± 1%  -69.38% (n=50)
geomean                 0.008192n         0.008203n        +0.13%                   0.003739n       -54.36%                   0.003120n       -61.91%                   0.003185n       -61.12%

Versus Regexp (make benchstats-regexp)

These benchmarks are loosely comparing regexp with rxp in "as similar as can be done" cases. While rxp seems to outperform regexp, note that a poorly crafted rxp Pattern can easily tank performance, as in the pipeline readme case below.

goos: linux
goarch: arm64
pkg: github.com/go-corelibs/rxp
                 │     regexp      │                   rxp                   │
                 │     sec/op      │     sec/op       vs base                │
_FindAllString      0.005376n ± 0%    0.002162n ± 1%  -59.78% (n=50)
_Pipeline_Combo    0.0028000n ± 1%   0.0002858n ± 1%  -89.79% (n=50)
_Pipeline_Readme    0.005760n ± 1%    0.007117n ± 1%  +23.56% (p=0.000 n=50)
_Replace_ToUpper     0.02497n ± 0%     0.02339n ± 1%   -6.29% (p=0.000 n=50)
geomean             0.006821n         0.003185n       -53.31%

Go-CoreLibs

Go-CoreLibs is a repository of shared code between the Go-Curses and Go-Enjin projects.

License

Copyright 2024 The Go-CoreLibs Authors

Licensed under the Apache License, Version 2.0 (the "License");
you may not use file except in compliance with the License.
You may obtain a copy of the license at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

# Functions

A creates a Matcher equivalent to the regexp [\A].
Alnum creates a Matcher equivalent to [:alnum:].
Alpha creates a Matcher equivalent to [:alpha:].
Ascii creates a Matcher equivalent to [:ascii:].
B creates a Matcher equivalent to the regexp [\b].
BackRef is a Matcher equivalent to Perl backreferences where the gid argument is the match group to use BackRef will panic if the gid argument is less than one.
Blank creates a Matcher equivalent to [:blank:].
Caret creates a Matcher equivalent to the regexp caret [^].
Cntrl creates a Matcher equivalent to [:cntrl:].
D creates a Matcher equivalent to the regexp \d.
Digit creates a Matcher equivalent to [:digit:].
Dollar creates a Matcher equivalent to the regexp [$].
Dot creates a Matcher equivalent to the regexp dot (.).
Graph creates a Matcher equivalent to [:graph:].
Group processes the list of Matcher instances, in the order they were given, and stops at the first one that does not match, discarding any consumed runes.
IsAtLeastSixDigits creates a Matcher equivalent to: (?:\A[0-9]{6,}\z).
IsFieldKey creates a Matcher equivalent to: (?:\b[a-zA-Z][-_a-zA-Z0-9]+?[a-zA-Z0-9]\b) IsFieldKey is intended to validate CSS and HTML attribute key names such as "data-thing" or "some_value".
IsFieldWord creates a Matcher equivalent to: (?:\b[a-zA-Z0-9]+?[-_a-zA-Z0-9']*[a-zA-Z0-9]+\b|\b[a-zA-Z0-9]+\b).
IsHash10 creates a Matcher equivalent to: (?:[a-fA-F0-9]{10}).
IsKeyword is intended for Go-Enjin parsing of simple search keywords from user input and creates a Matcher equivalent to: (?:\b[-+]?[a-zA-Z][-_a-zA-Z0-9']+?[a-zA-Z0-9]\b).
IsUnicodeRange creates a Matcher equivalent to the regexp \pN where N is a unicode character class, passed to IsUnicodeRange as a unicode.RangeTable instance For example, creating a Matcher for a single braille character: IsUnicodeRange(unicode.Braille).
IsUUID creates a Matcher equivalent to: (?:[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}).
Lower creates a Matcher equivalent to [:lower:].
MakeMatcher creates a rxp standard Matcher implementation wrapped around a given RuneMatcher.
NamedClass creates a Matcher equivalent to the regexp [:AsciiNames:], see the AsciiNames constants for the list of supported ASCII class names NamedClass will panic if given an invalid class name.
NewInputReader creates a new InputReader instance for the given input string.
Not processes all the matchers given, in the order they were given, stopping at the first one that succeeds and inverts the proceed return value Not is equivalent to a negated character class in traditional regular expressions, for example: [^xyza-f] could be implemented as any of the following: // slower due to four different matchers being present Not(Text("x"),Text("y"),Text("z"),R("a-f")) // better but still has two matchers Not(R("xyza-f")) // no significant difference from the previous Or(R("xyza-f"), "^") //< negation (^) flag // simplified to just one matcher present R("xyza-f", "^") //< negation (^) flag here's the interesting bit about rxp though, if speed is really the goal, then the following would capture single characters matching [^xyz-af] with significant performance over MakeMatcher based matchers (use Pattern.Add to include the custom Matcher) func(scope Flags, reps Reps, input *InputReader, index int, sm [][2]int) (scoped Flags, consumed int, proceed bool) { scoped = scope if r, size, ok := input.Get(index); ok { // test for [xyza-f] proceed = (r >= 'x' && r <= 'z') || (r >= 'a' && r <= 'f') // and invert the result proceed = !proceed if proceed { // true means the negation is a match // MatchedFlag is required, CaptureFlag optionally if needed scoped |= MatchedFlag | CaptureFlag // consume this rune's size if a capture group is needed // using size instead of just 1 will allow support for // accurate []rune, []byte and string processing consumed += size } return }.
Or processes the list of Matcher instances, in the order they were given, and stops at the first one that returns a true next Or accepts Pattern, Matcher and string types and will panic on all others.
ParseFlags parses a regexp-like option string into a Flags instance and two integers, the low and high range of repetitions | Flags | Description | |---------|-----------------------------------------------------------------------------------------| | ^ | Invert the meaning of this match group | | m | Multiline mode Caret and Dollar match begin/end of line in addition to begin/end text | | s | DotNL allows Dot to match newlines (\n) | | i | AnyCase is case-insensitive matching of unicode text | | c | Capture allows this Matcher to be included in Pattern substring results | | * | zero or more repetitions, prefer more | | + | one or more repetitions, prefer more | | ? | zero or one repetition, prefer one | | {l,h} | range of repetitions, l minimum and up to h maximum, prefer more | | {l,} | range of repetitions, l minimum, prefer more | | {l} | range of repetitions, l minimum, prefer more | | *? | zero or more repetitions, prefer less | | +? | one or more repetitions, prefer less | | ?? | zero or one repetition, prefer zero | | {l,h}? | range of repetitions, l minimum and up to h maximum, prefer less | | {l,}? | range of repetitions, l minimum, prefer less | | {l}? | range of repetitions, l minimum, prefer less | The flags presented above can be combined into a single string argument, or can be individually given to ParseFlags Any parsing errors will result in a runtime panic.
ParseOptions accepts Pattern, Matcher and string options and recasts them into their specific types ParseOptions will panic with any type other than Pattern, Matcher or string.
Print creates a Matcher equivalent to [:print:].
Punct creates a Matcher equivalent to [:punct:].
R creates a Matcher equivalent to regexp character class ranges such as: [xyza-f] where x, y and z are individual runes to accept and a-f is the inclusive range of letters from lowercase a to lowercase f to accept Note: do not include the [] brackets unless the intent is to actually accept those characters.
RuneIsALNUM returns true for alphanumeric characters [a-zA-Z0-9].
RuneIsALPHA returns true for alphanumeric characters [a-zA-Z].
RuneIsASCII returns true for valid ASCII characters [\x00-\x7F].
RuneIsBLANK returns true for tab and space characters [\t ].
RuneIsCNTRL returns true for control characters [\x00-\x1F\x7F].
RuneIsDIGIT returns true for number digits [0-9].
RuneIsGRAPH returns true for graphical characters [a-zA-Z0-9!"$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~] Note: upon the first use of RuneIsGRAPH, a lookup map is cached in a global variable and used for detecting the specific runes supported by the regexp [:graph:] class.
RuneIsLOWER returns true for lowercase alphabetic characters [a-z].
RuneIsPRINT returns true for space and RuneIsGRAPH characters [ [:graph:]] Note: uses RuneIsGRAPH.
RuneIsPUNCT returns true for punctuation characters [!-/:-@[-`{-~].
RuneIsSpace returns true for space characters [\t\n\f\r ].
RuneIsSPACE returns true for empty space characters [\t\n\v\f\r ].
RuneIsUPPER returns true for lowercase alphabetic characters [A-Z].
RuneIsWord returns true for word characters [_a-zA-Z0-9].
RuneIsXDIGIT returns true for hexadecimal digits [z-fA-F0-9].
S creates a Matcher equivalent to the regexp \s.
Space creates a Matcher equivalent to [:space:].
Text creates a Matcher for the plain text given.
Upper creates a Matcher equivalent to [:upper:].
W creates a Matcher equivalent to the regexp \w.
Word creates a Matcher equivalent to [:word:].
WrapMatcher creates a Matcher using MakeMatcher and wrapping a RuneMatcher.
Xdigit creates a Matcher equivalent to [:xdigit:].
Z is a Matcher equivalent to the regexp [\z].

# Constants

No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author

# Variables

No description provided by the author

# Structs

InputReader is an efficient rune based buffer.
Stage is one phase of a text replacement Pipeline and receives an input string from the previous stage (or the initial input text) and returns the output provided to the next stage (or is finally returned to the caller).

# Type aliases

No description provided by the author
No description provided by the author
Matcher is a single string matching function | Argument | Description | |----------|------------------------------------| | scope | current Flags for this iteration | | reps | min and max repetition settings | | input | input rune slice (do not modify!) | | index | current input rune index to match | | Return | Description | |----------|------------------------------------| | scoped | possibly modified sub-match scope | | consumed | number of runes matched from index | | proceed | success, keep matching for more |.
Pattern is a list of Matcher functions, all of which must match, in the order present, in order to consider the Pattern to match.
Pipeline is a list of stages for transforming strings in a single procedure.
Replace is a Replacer pipeline.
Replacer is a Replace processor function The captured argument is the result of the Pattern match process and is composed of the entire matched text as the first item in the captured list, and any Pattern capture groups following The modified string argument is the output of the previous Replacer in the Replace process, or the original matched input text if this is the first Replacer in the process.
No description provided by the author
RuneMatcher is the signature for the basic character matching functions such as RuneIsWord Implementations are expected to operate using the least amount of CPU instructions possible.
Transform is the function signature for non-rxp string transformation pipeline stages.