Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Musk and Zuckerberg convinced Trump to repeal AI executive order

May 26, 2026

Introducing Gemini Omni

May 25, 2026

IMDA updates AI framework, OpenAI opens Singapore AI Lab

May 24, 2026
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Tuesday, May 26
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»New in llama.cpp: Model Management
Tools

New in llama.cpp: Model Management

versatileaiBy versatileaiDecember 12, 2025No Comments3 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email



victor master avatar

The llama.cpp server now comes with router mode, allowing you to dynamically load, unload, and switch between multiple models without restarting.

Note: The llama.cpp server is a lightweight OpenAI-compatible HTTP server for running LLM locally.

This feature was a popular request to bring Ollama-style model management to llama.cpp. We use a multi-process architecture where each model runs in its own process, so if one model crashes, other models are not affected.

quick start

Starts the server in router mode without specifying a model.

llama server

This will auto-detect the model from the llama.cpp cache (LLAMA_CACHE or ~/.cache/llama.cpp). If you have previously downloaded models via llama-server -hf user/model, they are automatically available.

You can also specify a local directory for the GGUF file.

llama server –models-dir ./my-models

Features

Auto-detection: scans the llama.cpp cache (default) or a custom –models-dir folder for GGUF files On-demand loading: models are automatically loaded the first time they are requested LRU eviction: pressing –models-max (default: 4) unloads the most recently used model Request routing: the model field in the request determines which model handles it

example

Chat with specific models

curl http://localhost:8080/v1/chat/completions \ -H “Content type: application/json” \ -d ‘{
“Model”: “ggml-org/gemma-3-4b-it-GGUF:Q4_K_M”,
“Message”: ({“Role”: “User”, “Content”: “Hello!”})
}’

On the first request, the server automatically loads the model into memory (load time depends on the size of the model). Subsequent requests for the same model will be instantaneous since it is already loaded.

List available models

curl http://localhost:8080/models

Returns all discovered models along with their status (loaded, loading, or unloaded).

Load the model manually

curl -X POST http://localhost:8080/models/load \ -H “Content type: application/json” \ -d ‘{“model”: “my-model.gguf”}’

Unload the model to free up VRAM

curl -X POST http://localhost:8080/models/unload \ -H “Content type: application/json” \ -d ‘{“model”: “my-model.gguf”}’

Main options

Flag Description –models-dir PATH Directory containing GGUF files –models-max N Maximum number of simultaneously loaded models (default: 4) –no-models-autoload Disable automatic loading. Requires explicit /models/load call

All model instances inherit settings from the router.

llama server –model-directory ./model -c 8192 -ngl 99

All loaded models use 8192 contexts and full GPU offload. You can also define per-model settings using presets.

llama server –model-preset config.ini

(my model)
model = /path/to/model.gguf
ctx size = 65536
temperature = 0.7

Also available in web UI

The built-in web UI also supports model switching. Just select your model from the dropdown and it will load automatically.

join the conversation

We hope this feature makes it easy to A/B test different model versions, perform multi-tenant deployments, or simply switch models during development without restarting the server.

Have questions or feedback? Drop a comment below or open an issue on GitHub.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleAI tools revolutionize social media images
Next Article Disney invests $1 billion in OpenAI, licenses over 200 characters for Sora AI tool
versatileai

Related Posts

Tools

Musk and Zuckerberg convinced Trump to repeal AI executive order

May 26, 2026
Tools

Introducing Gemini Omni

May 25, 2026
Tools

IMDA updates AI framework, OpenAI opens Singapore AI Lab

May 24, 2026
Add A Comment

Comments are closed.

Top Posts

Edimakor V4.2.0 unveils AI video tools at VEO 3

August 4, 202553 Views

Pillar Security raises $9 million to create AI security guardrails for businesses

April 18, 202537 Views

10 Best AI for PowerPoint presentations

February 13, 202536 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Edimakor V4.2.0 unveils AI video tools at VEO 3

August 4, 202553 Views

Pillar Security raises $9 million to create AI security guardrails for businesses

April 18, 202537 Views

10 Best AI for PowerPoint presentations

February 13, 202536 Views
Don't Miss

Musk and Zuckerberg convinced Trump to repeal AI executive order

May 26, 2026

Introducing Gemini Omni

May 25, 2026

IMDA updates AI framework, OpenAI opens Singapore AI Lab

May 24, 2026
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?