Categorygithub.com/sugarme/tokenizer
modulepackage
0.2.2
Repository: https://github.com/sugarme/tokenizer.git
Documentation: pkg.go.dev

# README

Tokenizer LicenseGo.Dev referenceTravis CIGo Report Card

Overview

tokenizer is pure Go package to facilitate applying Natural Language Processing (NLP) models train/test and inference in Go.

It is heavily inspired by and based on the popular HuggingFace Tokenizers.

tokenizer is part of an ambitious goal (together with transformer and gotch) to bring more AI/deep-learning tools to Gophers so that they can stick to the language they love and build faster software in production.

Features

tokenizer is built in modules located in sub-packages.

  1. Normalizer
  2. Pretokenizer
  3. Tokenizer
  4. Post-processing

It implements various tokenizer models:

  • Word level model
  • Wordpiece model
  • Byte Pair Encoding (BPE)

It can be used for both training new models from scratch or fine-tuning existing models. See examples detail.

Basic example

This tokenizer package is compatible to load pretrained models from Huggingface. Some of them can be loaded using pretrained subpackage.

import (
	"fmt"
	"log"

	"github.com/sugarme/tokenizer/pretrained"
)

func main() {
    // Download and cache pretrained tokenizer. In this case `bert-base-uncased` from Huggingface
    // can be any model with `tokenizer.json` available. E.g. `tiiuae/falcon-7b`
	configFile, err := tokenizer.CachedPath("bert-base-uncased", "tokenizer.json")
	if err != nil {
		panic(err)
	}

	tk, err := pretrained.FromFile(configFile)
	if err != nil {
		panic(err)
	}

	sentence := `The Gophers craft code using [MASK] language.`
	en, err := tk.EncodeSingle(sentence)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("tokens: %q\n", en.Tokens)
	fmt.Printf("offsets: %v\n", en.Offsets)

	// Output
	// tokens: ["the" "go" "##pher" "##s" "craft" "code" "using" "[MASK]" "language" "."]
	// offsets: [[0 3] [4 6] [6 10] [10 11] [12 17] [18 22] [23 28] [29 35] [36 44] [44 45]]
}

All models can be loaded from files manually. pkg.go.dev for detail APIs.

Getting Started

License

tokenizer is Apache 2.0 licensed.

Acknowledgement

# Packages

No description provided by the author
No description provided by the author
No description provided by the author
Basic text preprocessing tasks are: 1.
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author

# Functions

CachedPath resolves and caches data based on input string, then returns fullpath to the cached data.
CleanCache removes all files cached in transformer cache directory `CachedDir`.
ConfigFromFile loads config from file.
DefaultAddedToken initiates a default AddedToken.
Default creates an encoding with default values.
No description provided by the author
DefaultProcess is a helper function of PostProcessor's Process method It helps to fast track by just merging encoding and its pair.
MergeEncodings merges slice of encodings together.
NewAddedToken builds an AddedToken from given content specifying whether it is intended to be a special token.
No description provided by the author
No description provided by the author
No description provided by the author
NewEncoding initiate a new encoding from input data.
NewEncodingFromTokens initiate Encoding from input tokens.
No description provided by the author
NewInputSequence creates a new InputSequence from input A valid input can be a string type (RawInput) or slice of string (PretokenizedInput).
No description provided by the author
NewPreTokenizedString create a new PreTokenizedString from input string.
NewNormalizedStringFromNS creates a PreTokenizedString from input NormalizedString.
No description provided by the author
No description provided by the author
NewSplit creates a new Split from a input NormalizedString.
Implement methods for `Token` NewToken generate new token from input data.
Implementing methods for Tokenizer.
NewTokenizerFromFile instantiates a new Tokenizer from the given file.
No description provided by the author
PrepareEncodings prepares encoding and pairEncoding if any before `ProcessEncodings` call.
No description provided by the author
No description provided by the author
No description provided by the author
WithLStrip specify whether this token should include all the whitespaces on its left in order to strip them out.
WithNormalized specifies whether this token should be normalized and match against its normalized version in the input text.
WithRStrip specify whether this token should include all the whitespaces on its right in order to strip them out.
No description provided by the author
WithSingleWord specifies whether this token should only match on whole single words, and never part of a word.
No description provided by the author

# Constants

No description provided by the author
No description provided by the author
No description provided by the author
NOTE.
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author

# Variables

No description provided by the author
No description provided by the author

# Structs

AddedToken represents a token added by the user on top of the existing model vocabulary.
No description provided by the author
AddedVocabulary is a vocabulary built on top of the Model This provides a way to add new vocabulary to a Tokenizer that has already been trained, in a previous process, maybe by someone else.
No description provided by the author
Config construct configuration for creating Tokenizer.
No description provided by the author
No description provided by the author
Encoding represents the output of tokenizer.
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
PaddingStrategy is a enum of either - string `BatchLongest` - or a func type `Fixed(uint)` which return a uint Example: func main() { var ps PaddingStrategy ps = NewPaddingStrategy(WithFixed(3)) fmt.Println(ps.Value) }.
No description provided by the author
No description provided by the author
The `PreTokenizedString` is in charge of splitting an underlying string, making sure everything is fine while doing so, and providing ways to normalize and tokenize these splits.
No description provided by the author
No description provided by the author
Split contains the underlying `NormalizedString` as well as its offsets in the original string.
No description provided by the author
No description provided by the author
No description provided by the author
Tokenizer represents a tokenization pipeline.
No description provided by the author

# Interfaces

Decoder takes care of (merges) the given slice of tokens to string.
No description provided by the author
Model represents a model used during tokenization (i.e., BPE, Word, or Unigram).
No description provided by the author
PostProcessor is in charge of post-processing an encoded output of the `Tokenizer`.
PreTokenizer is in charge of doing the pre-segmentation step.
Trainer is responsible for training a model.

# Type aliases

No description provided by the author
No description provided by the author
No description provided by the author
OffsetType is a enum-like possible type of offsets.
No description provided by the author
No description provided by the author
No description provided by the author
SplitFn takes a `NormalizedString` and returns an iterator over the produced `NormalizedString`.
TruncationStrategy is enum of int type represents truncation strategy.