Categorygithub.com/aaw/regrams
modulepackage
0.0.0-20230610133622-813af43b179b
Repository: https://github.com/aaw/regrams.git
Documentation: pkg.go.dev

# README

regrams

regrams is a go package that implements an engine for text search by regular expression queries. Given a collection of documents that are indexed by trigrams and a regular expression, regrams can generate a boolean query for the trigram index that retrieves good candidates that might match the regular expression.

For example, regrams.MakeQuery("a(bc)+d") returns (abc) (bcb|bcd), which means that any document that matches the regular expression a(bc)+d must contain the trigram abc as well as one of bcb or bcd.

regrams is inspired by Google's codesearch repo and Russ Cox's corresponding blog post explaining the implementation. Instead of using the structure of the parsed regular expression to generate queries, however, regrams creates an NFA from the query with nodes weighted by the size of the reachable trigram sets and repeatedly extracts disjunctions from the NFA by identifying and removing minimum-weight graph cuts. I wrote a post on regular expression search via graph cuts that gives a lot more detail about the technique.

The easiest way to try out this package is to run the command-line wrapper included in this repo:

$ go run cmd/regrams/main.go abcd
(abc) (bcd)
$ go run cmd/regrams/main.go 'Time: [0-9]+ ms'
( ms) (: 0|: 1|: 2|: 3|: 4|: 5|: 6|: 7|: 8|: 9) (Tim) (e: ) (ime) (me:)
$ go run cmd/regrams/main.go '(abc*)+de'
(aba|abc|abd) (bab|bca|bcc|bcd|bde)
$ go run cmd/regrams/main.go '(?i)abc'
(ABC|ABc|AbC|Abc|aBC|aBc|abC|abc)

Running go run cmd/regrams/main.go --help will give you more options.

Alternatively, you can try out examples in a browser at the Go Playground.

To use regrams as a package, import github.com/aaw/regrams, then call regrams.MakeQuery in your code to generate a trigram query from a string representing a regular expression. MakeQuery returns a slice-of-slices of strings, which should be interpreted as a boolean query in conjunctive normal form: the slice [["abc"], ["xyz","xyw"], ["xxx", "yyy", "zzz"]] represents the query abc AND (xyz OR xyw) AND (xxx OR yyy OR zzz).

# Packages

No description provided by the author

# Functions

No description provided by the author
No description provided by the author

# Structs

Only use this if you want to tune config values in query processing like how large a trigram set to consider, how large an NFA to consider, etc.

# Type aliases

A trigram query.