pkg.gl

Categorygithub.com/sugarme/tokenizernormalizer

package

0.2.2

Repository: https://github.com/sugarme/tokenizer.git

Documentation: pkg.go.dev

# Functions

BytesToChar

BytesToChar converts a given range from bytes to `char`.

CharToBytes

CharToBytes converts a given range from `char` to bytes.

IsBertPunctuation

IsBertPunctuation checks whether an input rune is a BERT punctuation.

IsBertWhitespace

IsBertWhitespace checks whether an input rune is a BERT whitespace.

IsChinese

isChinese validates that rune c is in the CJK range according to BERT spec.

IsPunctuation

IsPunctuation returns whether input rune is a punctuation or not.

IsWhitespace

IsWhitespace checks whether an input rune is a whitespace.

Lowercase

Lowercase creates a lowercase normalizer.

NewBertNormalizer

No description provided by the author

NewDefaultNormalizer

No description provided by the author

NewFnPattern

No description provided by the author

NewInvertPattern

No description provided by the author

NewNFC

No description provided by the author

NewNFD

No description provided by the author

NewNFKC

No description provided by the author

NewNFKD

No description provided by the author

NewNormalizedFrom

NewNormalizedFrom creates a Normalized instance from string input.

NewNormalizedString

No description provided by the author

NewNormalizer

No description provided by the author

NewPrepend

No description provided by the author

NewRange

No description provided by the author

NewRegexpPattern

No description provided by the author

NewReplace

No description provided by the author

NewRunePattern

No description provided by the author

NewSequence

No description provided by the author

NewStringPattern

No description provided by the author

NewStrip

No description provided by the author

NewStripAccents

No description provided by the author

NewUnicodeNormalizer

No description provided by the author

RangeOf

RangeOf returns a range of normalized string It will return empty string if input range is out of bound.

WithBertNormalizer

WithBertNormalizer creates normalizer with BERT normalization features.

WithLowercase

No description provided by the author

WithStrip

No description provided by the author

WithUnicodeNormalizer

WithUnicodeNormalizer creates normalizer with one of unicode NFD, NFC, NFKD, or NFKC normalization feature.

# Constants

ContiguousBehavior

No description provided by the author

IsolatedBehavior

No description provided by the author

MergedWithNextBehavior

No description provided by the author

MergedWithPreviousBehavior

No description provided by the author

NormalizedTarget

No description provided by the author

OriginalTarget

No description provided by the author

Regex

No description provided by the author

RemovedBehavior

No description provided by the author

String

No description provided by the author

# Structs

BertNormalizer

No description provided by the author

ChangeMap

No description provided by the author

DefaultNormalizer

No description provided by the author

FnPattern

No description provided by the author

Invert

Invert the `is_match` flags for the wrapped Pattern.

NFC

No description provided by the author

NFD

No description provided by the author

NFKC

No description provided by the author

NFKD

No description provided by the author

NormalizedString

A `NormalizedString` takes care of processing an "original" string to modify it and obtain a "normalized" string.

OffsetsMatch

OfsetsMatch contains a combination of Offsets position and a boolean indicates whether this is a match or not.

OffsetsRemove

No description provided by the author

Precompiled

No description provided by the author

Prepend

Prepend creates a normalizer that strip the normalized string inplace.

Range

Range is a slice of indexes on either normalized string or original string It is INCLUSIVE start and EXCLUSIVE end.

RegexpPattern

No description provided by the author

Replace

No description provided by the author

RunePattern

RunePattern is a wrapper of primitive rune so that it can implement `Pattern` interface.

Sequence

Sequence wraps a slice of normalizers to normalize string in sequence.

StringPattern

String is a wrapper of primitive string so that it can implement `Pattern` interface.

Strip

No description provided by the author

StripAccents

No description provided by the author

UnicodeNormalizer

No description provided by the author

# Interfaces

Normalizer

No description provided by the author

Pattern

Pattern is used to split a NormalizedString.

# Type aliases

DefaultOption

No description provided by the author

IndexOn

RangeType is a enum like representing which string (original or normalized) then range indexes on.

NormFn

NormFn is a convenient function type for applying on each `char` of normalized string.

Option

No description provided by the author

PatternFn

PatternFn is a func type to apply pattern.

ReplacePattern

Enum of different patterns that Replace can use.

SplitDelimiterBehavior

SplitDelimiterBehavior is a enum-like type .