Categorygithub.com/dolthub/go-icu-regex
modulepackage
0.0.0-20230524105445-af7e7991c97e
Repository: https://github.com/dolthub/go-icu-regex.git
Documentation: pkg.go.dev

# README

ICU Regular Expressions in Go

The ICU library is used in MySQL to parse regular expressions. Go's built-in regular expressions follow a different standard than ICU, and thus can cause inconsistencies when attempting to match MySQL's behavior. These inconsistencies would hopefully result in an error (prompting user intervention), but may silently return unexpected results, raising no alarm when data is being modified in unexpected ways.

To get around this, we've implemented the necessary ICU functions by compiling them into a WebAssembly module, and running the module using the wazero library. Although this approach does come with a performance penalty, this allows for implementing packages to retain cross-compilation support, as CGo is not invoked due to this package.

Building

To make modifications to the compiled WASM module, we've included a build script. The requirements are as follows:

  • Emscripten v3.1.38
  • wasm2wat
  • wat2wasm

Other Emscripten versions may compile just fine, however they have not been tested, and thus we restrict compilation to only the tested version. This also means that the ICU library is version 68.1, as that is the only version that our supported version of Emscripten has ported. Both wasm2wat and wat2wasm exist to expose the global stack variable, as not all platforms will expose the variable. None of the exposed functions require ICU's data, thus it has been excluded to save on space and memory usage. MySQL, although collation aware (and in spite of what the documentation may suggest), does not make use of any collation functionality in the context of regular expressions.

Notes

Due to the high startup-cost of the WASM runtime, this package enforces that all Regex objects are closed before being dereferenced. If any Regex objects are dereferenced before being closed, then a panic will occur at some non-deterministic point in the future.

# Functions

CreateRegex creates a Regex, with a region of memory that has been preallocated to support strings that are less than or equal to the given size.

# Constants

Enable case insensitive matching.
Allow white space and comments within patterns.
If set, '.' matches line terminators, otherwise '.' matching stops at line end.
Error on Unrecognized backslash escapes.
If set, treat the entire pattern as a literal string.
Control behavior of "$" and "^".
Enable case insensitive matching.
Unicode word boundaries.
Unix-only line endings.

# Variables

ErrInvalidRegex is returned when an invalid regex is given.
ErrMatchNotYetSet is returned when attempting to use another function before the match string has been set.
ErrRegexNotYetSet is returned when attempting to use another function before the regex has been initialized.

# Interfaces

Regex is an interface that wraps around the ICU library, exposing ICU's regular expression functionality.

# Type aliases

No description provided by the author
RegexFlags are flags to define the behavior of the regular expression.
No description provided by the author
No description provided by the author
No description provided by the author