Categorygithub.com/llmariner/inference-manager

# README

inference-manager

The inference-manager manages inference runtimes (e.g., vLLM and Ollama) in containers, load models, and process requests.

Set up Inference Server/Engine for development

Requirements:

Run the following command:

make setup-all

[!TIP]

  • Run just only make helm-reapply-inference-server or make helm-reapply-inference-engine, it will rebuild inference-manager container images, deploy them using the local helm chart, and restart containers.
  • You can configure parameters in .values.yaml.

Run vLLM on ARM macOS

To run vLLM on ARM CPU (macOS), you'll need to build an image.

git clone https://github.com/vllm-project/vllm.git
cd vllm
docker build -f Dockerfile.arm -t vllm-cpu-env --shm-size=4g .
kind load docker-image vllm-cpu-env:latest

Then, run make with the RUNTIME option.

make setup-all RUNTIME=vllm

[!NOTE] See vLLM - ARM installation for details.

Try out inference APIs

with curl:

curl --request POST http://localhost:8080/v1/chat/completions -d '{
  "model": "google-gemma-2b-it-q4_0",
  "messages": [{"role": "user", "content": "hello"}]
}'

with llma:

export LLMARINER_API_KEY=dummy
llma chat completions create \
    --model google-gemma-2b-it-q4_0 \
    --role system \
    --completion 'hi'

# Packages

No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author