12B Expert Mixture Model by JetBrains

Mellum2 is a 12B parameter expert mixture model trained from scratch based on natural language and code. This model activates only 25 billion parameters per token, allowing for efficient high-throughput, low-latency inference. Mellum2 can be used for routing, RAG, summarization, subagents, high-throughput coding capabilities, and private deployments. Released under the Apache 2.0 License. Mellum2 delivers competitive benchmark performance while achieving over 2x faster inference compared to similarly sized models. Download the model at Hugging Face: https://huggingface.co/collections/JetBrains/mellum-2 Read the full technical report for architecture details, training settings, benchmarks, and evaluation methods: https://arxiv.org/pdf/2605.31268

Today, we’re releasing Mellum2, an open expert mixed model optimized for low-latency text and code workloads. Mellum originally started as a code completion model. Mellum2 maintains a model focused on efficient inference and deployability while extending its foundation to a broader range of natural language and software engineering tasks. Modern AI systems increasingly rely on multiple model calls, including routing, retrieval, summarization, planning, validation, and tool usage. Many of these operations are delay sensitive and do not require the largest model available. Mellum2 targets these workloads.

Benchmark highlights

Our technical report evaluates Mellum2 across code generation, reasoning, science, and math benchmarks. Mellum2 delivers more than twice as fast inference while competing with similarly sized open models, making it suitable for high-throughput production workloads. Model Architecture Mellum2 is an expert mixture model.

Model Total Parameters Active Parameters Per Token Modality License Mellum2 12B 2.5B Text and Code Apache 2.0

The MoE architecture keeps the total capacity of the model high while activating only a subset of the parameters for each token. This makes inference more efficient and reduces processing costs for real-time workloads. Mellum2 intentionally focuses on text and code rather than multimodal tasks. This specialization keeps the model compact and efficient for software engineering workloads.

Main usage examples

Routing and orchestration

Mellum2 works well as a lightweight routing and orchestration model in multi-model systems, including prompt classification, tool selection, and intermediate control flow steps.

RAG pipeline

This model is suitable for latency-sensitive acquisition pipelines, such as context compression, summarization, and post-acquisition processing.

subagent

Mellum2 can be used for agent subtasks such as planning, validation, transformation, and context preparation, reducing the need to invoke large models for intermediate operations.

private deployment

Mellum2 can be delivered openly and efficiently, allowing it to be deployed in self-hosted environments containing proprietary code and internal data.

Why a well-scoped model is important

As AI systems mature, the most effective architectures are becoming less monolithic. While a single frontier model can be powerful, operational systems often require multiple specialized components (acquisitors, routers, code recognition models, validators, tool callers, and larger inference models) working together. We consider Mellum2 to be a “focal” model, a fast, pervasive model optimized for high-frequency tasks within large-scale AI systems. The goal is not to replace all models in the stack. The goal is to make stacks faster, cheaper, and easier to control.

Start Mellum2

If you’re building an AI system for software engineering (inside an IDE, in a RAG pipeline, as part of an agent workflow, or on private infrastructure), you’re ready to try Mellum2.

versatileai

See Full Bio

What's Hot

How AI is shortening drug discovery timelines in China

Introducing real-time generative simulation to surgical robotics

Enterprise AI agents, including engineers

How AI is shortening drug discovery timelines in China

Introducing real-time generative simulation to surgical robotics

Enterprise AI agents, including engineers

SenseTime’s Galaxy project aims to scale up domestic AI chips

Harness, scaffolding, and AI agent terminology worth getting right

New in llama.cpp: Model Management

Most Popular

SenseTime’s Galaxy project aims to scale up domestic AI chips

Harness, scaffolding, and AI agent terminology worth getting right

New in llama.cpp: Model Management

Don't Miss

How AI is shortening drug discovery timelines in China

Introducing real-time generative simulation to surgical robotics

Enterprise AI agents, including engineers

Subscribe to Updates

What's Hot

12B Expert Mixture Model by JetBrains

Benchmark highlights

Main usage examples

Routing and orchestration

RAG pipeline

subagent

private deployment

Why a well-scoped model is important

Start Mellum2

Related Posts