Categorygithub.com/gomlx/tokenizers
modulepackage
0.0.0-20240824032737-1de48c6f6440
Repository: https://github.com/gomlx/tokenizers.git
Documentation: pkg.go.dev

# README

Tokenizers for Go

Under Construction

UNDER CONSTRUCTION

Not functional yet, but for Gemma/Gemini/T5 and other Google models, see https://github.com/eliben/go-sentencepiece/.

About

Tokenizers for Language Models - Go API for HuggingFace Tokenizers

Highlights

[!IMPORTANT]
TODO: nothing implemented yet.

  • Allow customization to various LLMs, exposing most of the functionality of the HuggingFace Tokenizers library.
  • Provide a from_pretrained API, that downloads parameters to various known models -- levaraging HuggingFace Hub

Installation

This library is a wrapper around the Rust implementation by HuggingFace, and it requires the compiled Rust code available as a libgomlx_tokenizers.a.

To make that easy, the project provides a prebuilt libgomlx_tokenizers.a in the git repository (for the popular platforms), so for many nothing is needed (except having CGO enabled -- for cross-compilation set CGO_ENABLED=1), and it can be simply included as any other Go library.

If you want to build the underlying Rust wrapper and dependencies yourselves for any reason (including maybe to add support for a different platform), it uses the Mage build system -- an improved Makefile-like that uses Go.

If you create a new rule for a different platform, please consider contributing it back :smile:

[!IMPORTANT]
TODO

Thank You

Questions

Why fork and not collaborate with an already existing tokenizers project ?

I plan to revamp how the library is organized, its "ergonomics" to be more aligned with GoMLX APIs, and add documentation. I will also expand the functionality to match (as much as I'm able to do) HuggingFace's library. All this will completely break the API of the original repositories, and I felt too much to ask from the original authors.

# Packages

No description provided by the author

# Functions

DefaultCacheDir for HuggingFace Hub, same used by the python library.
Download returns file either from cache or by downloading from HuggingFace Hub.
FileExists returns true if file or directory exists.
FromPretrainedWith creates a new Tokenizer by downloading the pretrained tokenizer corresponding to the name.
GetHeaders is based on the `build_hf_headers` function defined in the [huggingface_hub](https://github.com/huggingface/huggingface_hub) library.
GetUrl is based on the `hf_hub_url` function defined in the [huggingface_hub](https://github.com/huggingface/huggingface_hub) library.
HttpUserAgent returns a user agent to use with HuggingFace Hub API.
RepoFolderName returns a serialized version of a hf.co repo name and type, safe for disk storage as a single non-nested folder.

# Constants

No description provided by the author
No description provided by the author
No description provided by the author
RepoIdSeparator is used to separate repository/model names parts when mapping to file names.

# Variables

DefaultDirCreationPerm is used when creating new cache subdirectories.
DefaultFileCreationPerm is used when creating files inside the cache subdirectories.
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author

# Structs

HFFileMetadata used by HuggingFace Hub.
PretrainedConfig for how to download (or load from disk) a pretrained Tokenizer.

# Type aliases

ProgressFn is a function called while downloading a file.