[!TIP]

Run just only make helm-reapply-inference-server or make helm-reapply-inference-engine, it will rebuild inference-manager container images, deploy them using the local helm chart, and restart containers.

You can configure parameters in .values.yaml.

Run vLLM on ARM macOS

To run vLLM on ARM CPU (macOS), you'll need to build an image.

git clone https://github.com/vllm-project/vllm.git
cd vllm
docker build -f Dockerfile.arm -t vllm-cpu-env --shm-size=4g .
kind load docker-image vllm-cpu-env:latest

Then, run make with the RUNTIME option.

make setup-all RUNTIME=vllm

[!NOTE] See vLLM - ARM installation for details.

Try out inference APIs

with curl:

curl --request POST http://localhost:8080/v1/chat/completions -d '{
  "model": "google-gemma-2b-it-q4_0",
  "messages": [{"role": "user", "content": "hello"}]
}'

with llma:

export LLMARINER_API_KEY=dummy
llma chat completions create \
    --model google-gemma-2b-it-q4_0 \
    --role system \
    --completion 'hi'