Mellum2 is a 12B parameter expert mixture model trained from scratch based on natural language and code. This model activates only 25 billion parameters per token, allowing for efficient high-throughput, low-latency inference. Mellum2 can be used for routing, RAG, summarization, subagents, high-throughput coding capabilities, and private deployments. Released under the Apache 2.0 License. Mellum2 delivers competitive benchmark performance while achieving over 2x faster inference compared to similarly sized models. Download the model at Hugging Face: https://huggingface.co/collections/JetBrains/mellum-2 Read the full technical report for architecture details, training settings, benchmarks, and evaluation methods: https://arxiv.org/pdf/2605.31268
Today, we’re releasing Mellum2, an open expert mixed model optimized for low-latency text and code workloads. Mellum originally started as a code completion model. Mellum2 maintains a model focused on efficient inference and deployability while extending its foundation to a broader range of natural language and software engineering tasks. Modern AI systems increasingly rely on multiple model calls, including routing, retrieval, summarization, planning, validation, and tool usage. Many of these operations are delay sensitive and do not require the largest model available. Mellum2 targets these workloads.
Benchmark highlights

Our technical report evaluates Mellum2 across code generation, reasoning, science, and math benchmarks. Mellum2 delivers more than twice as fast inference while competing with similarly sized open models, making it suitable for high-throughput production workloads. Model Architecture Mellum2 is an expert mixture model.
Model Total Parameters Active Parameters Per Token Modality License Mellum2 12B 2.5B Text and Code Apache 2.0
The MoE architecture keeps the total capacity of the model high while activating only a subset of the parameters for each token. This makes inference more efficient and reduces processing costs for real-time workloads. Mellum2 intentionally focuses on text and code rather than multimodal tasks. This specialization keeps the model compact and efficient for software engineering workloads.
Main usage examples
Routing and orchestration
Mellum2 works well as a lightweight routing and orchestration model in multi-model systems, including prompt classification, tool selection, and intermediate control flow steps.
RAG pipeline
This model is suitable for latency-sensitive acquisition pipelines, such as context compression, summarization, and post-acquisition processing.
subagent
Mellum2 can be used for agent subtasks such as planning, validation, transformation, and context preparation, reducing the need to invoke large models for intermediate operations.
private deployment
Mellum2 can be delivered openly and efficiently, allowing it to be deployed in self-hosted environments containing proprietary code and internal data.
Why a well-scoped model is important
As AI systems mature, the most effective architectures are becoming less monolithic. While a single frontier model can be powerful, operational systems often require multiple specialized components (acquisitors, routers, code recognition models, validators, tool callers, and larger inference models) working together. We consider Mellum2 to be a “focal” model, a fast, pervasive model optimized for high-frequency tasks within large-scale AI systems. The goal is not to replace all models in the stack. The goal is to make stacks faster, cheaper, and easier to control.
Start Mellum2
If you’re building an AI system for software engineering (inside an IDE, in a RAG pipeline, as part of an agent workflow, or on private infrastructure), you’re ready to try Mellum2.

