# README
inference-manager
The inference-manager manages inference runtimes (e.g., vLLM and Ollama) in containers, load models, and process requests.
Set up Inference Server/Engine for development
Requirements:
Run the following command:
make setup-all
[!TIP]
- Run just only
make helm-reapply-inference-server
ormake helm-reapply-inference-engine
, it will rebuild inference-manager container images, deploy them using the local helm chart, and restart containers.- You can configure parameters in .values.yaml.
Run vLLM on ARM macOS
To run vLLM on ARM CPU (macOS), you'll need to build an image.
git clone https://github.com/vllm-project/vllm.git
cd vllm
docker build -f Dockerfile.arm -t vllm-cpu-env --shm-size=4g .
kind load docker-image vllm-cpu-env:latest
Then, run make
with the RUNTIME
option.
make setup-all RUNTIME=vllm
[!NOTE] See vLLM - ARM installation for details.
Try out inference APIs
with curl:
curl --request POST http://localhost:8080/v1/chat/completions -d '{
"model": "google-gemma-2b-it-q4_0",
"messages": [{"role": "user", "content": "hello"}]
}'
with llma:
export LLMARINER_API_KEY=dummy
llma chat completions create \
--model google-gemma-2b-it-q4_0 \
--role system \
--completion 'hi'
# Packages
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author