New in llama.cpp: Model Management

victor master avatar

The llama.cpp server now comes with router mode, allowing you to dynamically load, unload, and switch between multiple models without restarting.

Note: The llama.cpp server is a lightweight OpenAI-compatible HTTP server for running LLM locally.

This feature was a popular request to bring Ollama-style model management to llama.cpp. We use a multi-process architecture where each model runs in its own process, so if one model crashes, other models are not affected.

quick start

Starts the server in router mode without specifying a model.

llama server

This will auto-detect the model from the llama.cpp cache (LLAMA_CACHE or ~/.cache/llama.cpp). If you have previously downloaded models via llama-server -hf user/model, they are automatically available.

You can also specify a local directory for the GGUF file.

llama server –models-dir ./my-models

Features

Auto-detection: scans the llama.cpp cache (default) or a custom –models-dir folder for GGUF files On-demand loading: models are automatically loaded the first time they are requested LRU eviction: pressing –models-max (default: 4) unloads the most recently used model Request routing: the model field in the request determines which model handles it

example

Chat with specific models

curl http://localhost:8080/v1/chat/completions \ -H “Content type: application/json” \ -d ‘{
“Model”: “ggml-org/gemma-3-4b-it-GGUF:Q4_K_M”,
“Message”: ({“Role”: “User”, “Content”: “Hello!”})
}’

On the first request, the server automatically loads the model into memory (load time depends on the size of the model). Subsequent requests for the same model will be instantaneous since it is already loaded.

List available models

curl http://localhost:8080/models

Returns all discovered models along with their status (loaded, loading, or unloaded).

Load the model manually

curl -X POST http://localhost:8080/models/load \ -H “Content type: application/json” \ -d ‘{“model”: “my-model.gguf”}’

Unload the model to free up VRAM

curl -X POST http://localhost:8080/models/unload \ -H “Content type: application/json” \ -d ‘{“model”: “my-model.gguf”}’

Main options

Flag Description –models-dir PATH Directory containing GGUF files –models-max N Maximum number of simultaneously loaded models (default: 4) –no-models-autoload Disable automatic loading. Requires explicit /models/load call

All model instances inherit settings from the router.

llama server –model-directory ./model -c 8192 -ngl 99

All loaded models use 8192 contexts and full GPU offload. You can also define per-model settings using presets.

llama server –model-preset config.ini

(my model)
model = /path/to/model.gguf
ctx size = 65536
temperature = 0.7

Also available in web UI

The built-in web UI also supports model switching. Just select your model from the dropdown and it will load automatically.

join the conversation

We hope this feature makes it easy to A/B test different model versions, perform multi-tenant deployments, or simply switch models during development without restarting the server.

Have questions or feedback? Drop a comment below or open an issue on GitHub.

versatileai

See Full Bio

What's Hot

NHS AI blood test could reduce invasive uterine cancer testing

How to shrink your token budget without downsizing your team

Native-speed vLLM Transformer Modeling Backend

NHS AI blood test could reduce invasive uterine cancer testing

How to shrink your token budget without downsizing your team

Native-speed vLLM Transformer Modeling Backend

L’Oréal, Mondelez and Nestlé use AI to speed up product development

Physical AI Conference Held in San Jose as Robotics and Autonomous AI Go Mainstream

New features in Mellea 0.4.0 + Granite library release

Most Popular

L’Oréal, Mondelez and Nestlé use AI to speed up product development

Physical AI Conference Held in San Jose as Robotics and Autonomous AI Go Mainstream

New features in Mellea 0.4.0 + Granite library release

Don't Miss

NHS AI blood test could reduce invasive uterine cancer testing

How to shrink your token budget without downsizing your team

Native-speed vLLM Transformer Modeling Backend

Subscribe to Updates

What's Hot

New in llama.cpp: Model Management

quick start

Features

example

Chat with specific models

List available models

Load the model manually

Unload the model to free up VRAM

Main options

Also available in web UI

join the conversation

Related Posts