![]()
The llama.cpp server now comes with router mode, allowing you to dynamically load, unload, and switch between multiple models without restarting.
Note: The llama.cpp server is a lightweight OpenAI-compatible HTTP server for running LLM locally.
This feature was a popular request to bring Ollama-style model management to llama.cpp. We use a multi-process architecture where each model runs in its own process, so if one model crashes, other models are not affected.
quick start
Starts the server in router mode without specifying a model.
llama server
This will auto-detect the model from the llama.cpp cache (LLAMA_CACHE or ~/.cache/llama.cpp). If you have previously downloaded models via llama-server -hf user/model, they are automatically available.
You can also specify a local directory for the GGUF file.
llama server –models-dir ./my-models
Features
Auto-detection: scans the llama.cpp cache (default) or a custom –models-dir folder for GGUF files On-demand loading: models are automatically loaded the first time they are requested LRU eviction: pressing –models-max (default: 4) unloads the most recently used model Request routing: the model field in the request determines which model handles it
example
Chat with specific models
curl http://localhost:8080/v1/chat/completions \ -H “Content type: application/json” \ -d ‘{
“Model”: “ggml-org/gemma-3-4b-it-GGUF:Q4_K_M”,
“Message”: ({“Role”: “User”, “Content”: “Hello!”})
}’
On the first request, the server automatically loads the model into memory (load time depends on the size of the model). Subsequent requests for the same model will be instantaneous since it is already loaded.
List available models
curl http://localhost:8080/models
Returns all discovered models along with their status (loaded, loading, or unloaded).
Load the model manually
curl -X POST http://localhost:8080/models/load \ -H “Content type: application/json” \ -d ‘{“model”: “my-model.gguf”}’
Unload the model to free up VRAM
curl -X POST http://localhost:8080/models/unload \ -H “Content type: application/json” \ -d ‘{“model”: “my-model.gguf”}’
Main options
Flag Description –models-dir PATH Directory containing GGUF files –models-max N Maximum number of simultaneously loaded models (default: 4) –no-models-autoload Disable automatic loading. Requires explicit /models/load call
All model instances inherit settings from the router.
llama server –model-directory ./model -c 8192 -ngl 99
All loaded models use 8192 contexts and full GPU offload. You can also define per-model settings using presets.
llama server –model-preset config.ini
(my model)
model = /path/to/model.gguf
ctx size = 65536
temperature = 0.7
Also available in web UI
The built-in web UI also supports model switching. Just select your model from the dropdown and it will load automatically.
join the conversation
We hope this feature makes it easy to A/B test different model versions, perform multi-tenant deployments, or simply switch models during development without restarting the server.
Have questions or feedback? Drop a comment below or open an issue on GitHub.

