pkg.gl

The llm-go package is a Go wrapper for llama.cpp that supports running large language models (specifically LLaMA models) in Go. It was derived from ollama's wrapper before their shift to embedding llama-server inside their own server. (If you need an easy to use local LLaMA server, please, use ollama.ai.)

This package is not meant to be API compatible with ollama's (soon to be deprecated) wrapper, nor is its API stable yet. We are still in the middle of a refactor and the GGML to GGUF shift in llama.cpp means work must be done.

Quickstart

Assuming you are on a Mac M1 (or M2), have Go and the Apple SDK, and a , the following should "just work":

wget https://huggingface.co/TheBloke/vicuna-7B-v1.5-GGML/resolve/main/vicuna-7b-v1.5.ggmlv3.q5_K_M.bin
go run github.com/swdunlop/llm-go/examples/io-worker vicuna-7b-v1.5.ggmlv3.q5_K_M.bin

From here, you can enter JSON prompts on stdin and get a stream of JSON predictions on stdout. Sample input:

// io-worker consumes JSONL, each object is processed serially and consists of a prompt for prediction.
{"prompt": "What is the softest breed of llama?"}

And sample output:

// io-worker emits JSONL, strings for incremental predictions, and a final JSON object with timing information
"\n"
" surely"
" not"
" the"
" one"
" that"
" s"
"ells"
" for"
" "
"1"
"."
"5"
" million"
" dollars"
// the final completion has the combined response and wall clock time.
{"response":"\n surely not the one that sells for 1.5 million dollars","seconds":0.942433625}

Supported Platforms

"Support" is a dirty word. This package is a wrapper around a C++ library that changes very fast. It works on Mac M1 using Metal acceleration. We would like it to work on Linux, but we also want to keep the build simple. (This is why we based llm-go on ollama's wrapper -- they got it working with just Go tools, no makefiles required.)

On MacOS you will need the Apple SDK for the following frameworks:

Accelerate
MetalKit
MetalPerformanceShaders

If you use Nix on MacOS, our flake should provide all the dependencies you need. (Keep in mind, if you use Nix to build, your binary will be linked against the Nix store, which means it will not run on other Macs.)

Updating GGML or LLaMA

Like the original ollama wrapper, llm-go currently uses a script to pull in the C++ code and headers from a llama.cpp checkout. This script also prepends Go build tags to control which features are built. (For example, if you don't have Metal acceleration, you can build without it.)

Using NATS Workers

The llm worker command will subscribe to a NATS subject and process prediction requests using the model specified in the llm worker environment. This can be combined with the llm client / llm predict command, or the ./nats package to request predictions over a NATS network.

This is particularly useful for running multiple instances of a model on other hosts.

Example Usage:

The following three BASH commands will start a NATS server, a llm worker that will will generate predictions to requests to llm.worker.default and connect to it with llm predict to generate a prediction.

gnatsd &
llm_model=vicuna-7b-v1.5.ggmlv3.q5_K_M.bin llm worker &
echo "What is the airspeed of an unladen swallow?" | llm_type=nats go run ./cmd/llm predict

# Packages

# README

Quickstart

Supported Platforms

Updating GGML or LLaMA

Using NATS Workers

Example Usage: