Categorygithub.com/cohere-ai/tokenizer
modulepackage
1.1.2
Repository: https://github.com/cohere-ai/tokenizer.git
Documentation: pkg.go.dev

# README

Tokenizers License

Cohere's tokenizers library provides an interface to encode and decode text given a computed vocabulary, and includes pre-computed tokenizers that are used to train Cohere's models.

We plan on eventually also open sourcing tools to create new tokenizers.

Example using Go

Choose a tokenizer inside of the vocab folder including both a encoder.json file and a vocab.bpe file and create an encoder as seen below. The tokenizer used in this example is named the coheretext-50k tokenizer.

import (
  ...
  "github.com/cohere-ai/tokenizer"
)

encoder := tokenizer.NewFromPrebuilt("coheretext-50k")

To encode a string of text, use the Encode method. Encode returns a slice of int64s.

encoded := encoder.Encode("this is a string to be encoded")
fmt.Printf("%v", encoded)
// [6372 329 258 3852 288 345 37754]

To decode a slice of int64s, use the Decode method. Decode returns a string.

fmt.Printf(encoder.Decode(encoded))
// this is a string to be encoded

Speed

Using a 2.5GHz CPU, encoding 1000 tokens takes approximately 6.5 milliseconds, and decoding 1000 tokens takes approximately 0.2 milliseconds.

# Functions

No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author

# Structs

No description provided by the author
No description provided by the author
No description provided by the author