We converted a 15B inference model to Mamba hybrid and achieved 2.1x throughput with minimal quality loss. What’s the key? Non-obvious insights into what data to extract based on and why intuition doesn’t help here.
When MiniMax published its M2 post-mortem in October, explaining why it had abandoned efficient attention at 230B scale, the narrative briefly became “efficient attention is dead.” Within days, Kimi Rinia proved otherwise. The real lesson is that it depends on the constraints.
Our constraints were simple. We have a powerful 15B inference model and needed to streamline it without having to reinvent the wheel. There is no infinite computation for pre-training of 20T tokens. We don’t have the luxury of co-designing the architecture from day one. This is a practical question. Can efficiency improvements be made to existing models through distillation?
Spoiler: Yes. But only if you ignore your intuition about what data to use.
what we built
Aprilel-H1 family: 7 checkpoints across 25-40 Mamba layers (out of 50 layers total). Representing the frontier of complete efficiency and quality. Our flagship product, Aprilel-H1-15b-Thinker-SFT, achieves 2.1x throughput with minimal quality loss. MATH500 and MTBench improve by a few points (0.90 → 0.92, 8.30 → 8.58, respectively), while GSM8k (0.97 → 0.95), GPQA (0.59 → 0.55), and AIME24 (0.70 → 0.58) regress slightly. Total training: 76.8 billion tokens.
Aprilel-H1-15b-Thinker-SFT (green) vs. Full Attention Teacher (blue). The inference quality remains roughly flat across the benchmarks, but the throughput increases by a factor of 1.89 to 2.09 depending on context length.
Further details can be found in the Aprilel-H1 paper. Here, we’ll focus on key insights to make it work.
non-trivial insights
Here’s what we initially thought would work: Just extract the pre-training data and finalize it with SFT.
The reasoning seemed solid. We’re inserting a completely new Mamba layer whose data we’ve never seen before. These linear SSMs must learn a mixture of generic tokens from scratch. How can you become an effective mixer unless you are exposed to the same wide distribution that the original attention group saw?
So we tried it. Next, I tried mixing pre-training with SFT data. It didn’t work. Distilled hybrids have lost their inferential quality, sometimes dramatically.
What actually worked was the high-quality inference traces from the teacher’s SFT dataset.
Extracting an inference model is not about transferring a general next token prediction. The base model already has that, starting with a strong 15B foundation. What we preserve is something concrete and fragile: the teacher’s multilevel reasoning patterns.
Those patterns emerge from complex attentional mechanisms. The search head obtains context from thousands of tokens. A guiding head that recognizes and continues the logical chain. A long-term dependency that connects a premise to a conclusion many steps later. Replacing attention completely with Mamba’s linear iterations confuses these computational mechanisms. Hybrids require discovering new paths to the same inference result.
That discovery requires explicit examples where the inferential structure is visibly correct.
Multi-step mathematical proofs where each thought follows from a previous result Coding tasks with clear logical dependencies Scientific analysis with detailed explanatory chains
On the other hand, pre-training data is noisy and too spread out. The inference signal is lost. I need focused examples of specific features I want to keep.
Once you understand your data selection, your extraction methods will also become clearer. We used the backward KL divergence (temperature 1) instead of the forward KL. Rivers won consistently. why? We train problems in which teachers have high confidence and a clear structure. The modal exploration behavior of reverse KL encourages students to commit to their reliable predictions. If teachers are confident and correct, they will want their students to be confident as well.
This insight is key to the entire approach. Match your distillation data to the capacity you’re maintaining, not the capacity you’re building.
Application method: stepwise distillation
You can’t easily swap 40 featured layers for Mamba and Hope. We learned this the hard way and eventually developed a step-by-step distillation procedure to ensure we achieved our goals.
Stage 1: Identify the least important layers. Leave-One-Out (LOO) analysis was used for MMLU. Remove each layer, replace it with an identity, and measure the drop. Sort by importance and replace the bottom 25 with Mamba-in-Llama (MIL) initialized mixers. Distilled end-to-end. This worked at the H-25 checkpoint.
Stage 2: Progressive transformation over 25 layers. LOO broke down beyond 25 layers. This is because layers that are not important on their own become important when combined. To address this, we developed a dynamic heuristic we call MIL-Mamba-Replacement (MMR). For each remaining attention layer, initialize the Mamba mixer using MIL, perform 100 training steps, and record the distillation loss. Layers that converge to lower losses are “easy” to replace. This captures the dynamics of training rather than static importance.
We progressed step by step through 25 → 27 → 30 → 34 → 37 → 40 Mamba layers and grouped permutations by MMR score. Each checkpoint is extracted from the previous checkpoint.
Stage 3: End-to-end training on SFT data. After reaching the target number of Mamba layers, we performed a final SFT pass until the inference performance stabilized. After 559 billion distilled tokens and 209 billion SFT tokens, the final Apriel-H1-15b-Thinker-SFT model was generated.

The frontier of complete efficiency. Each checkpoint displays a cumulative training token. Our flagship H-30-SFT (released as Aprilel-H1-15b-Thinker-SFT) used a total of 76.8 billion bytes and achieved 2.1x throughput with an average score of 0.76. The aggressively converted H-40 variant achieved 3.4x more throughput using 136.5B tokens. Reference: NVIDIA’s Nemotron-Nano-9B-v2 achieved 4.6x with a 0.77 score, but had to be trained from scratch using orders of magnitude more compute.
Make it reproducible: Fast-LLM
We built all of this on top of Fast-LLM, an open source training framework. Core architectural principle: Transformers for large language models should be modular. Note and Mamba are different implementations of the same “mixing” interface and are freely interchangeable.
The Fast-LLM configuration format hybrid architecture is as follows:
decoder:
type: “pattern”
block:
Attention block:
mixer:
type: “Note”
head: 32
Head group: 8
Head size: 128
mlp:
type: “With gate”
Activation: “Sil”
Mamba block:
mixer:
type: “Mamba”
dinner: 4096
Condition size: 16
dt_rank: 16
mlp:
type: “With gate”
Activation: “Sil”
Number of blocks: 50
pattern: (“Caution block”, “Caution block”, “Mamba Block”, …)
The pattern field specifies the order of the layers. For Apriel-H1-15b-Thinker-SFT: 30 mamba_blocks, 20 attention_blocks, placed by importance. that’s it.
Distillation is also a composition:
Model:
Base model:
head:
Distillation model: teacher
Implementation of distillation loss: reverse_kl
Reference model:
teacher:
Pre-trained:
format: Mistral
path: path/to/Apriel-Nemotron-15b-Thinker
Fast-LLM handles everything needed for large-scale experiments, including gradient accumulation, distributed training, tensor parallelism, and checkpointing. It is open source and licensed under Apache 2.0. You can reproduce this work because the infrastructure is designed to allow you to reproduce it.
FAQ
Why release all checkpoints? Because whether it’s optimal or not depends on your constraints. H-30 achieves the best balance. H-40 maximizes throughput for latency-sensitive workloads. Intermediate checkpoints allow you to choose precise tradeoffs.
Why does the speedup vary for different context lengths? Mamba’s linear complexity advantage increases with sequence length, and attention decreases quadratically.
Why did you only try Mamba? We used Mamba-1 for three reasons. Mamba-1 has a proven distillation track record, has shown strong empirical performance, and was easy to implement in the framework. Let’s focus on the data issue first.
What were Mamba’s hyperparameters? State size 16, DT rank 16, internal dimensions 4096. In Aprilel’s GQA setup, we expanded B (input projection) and x (state) to match the total attention head following M1.
Why didn’t I try a more advanced conversion method? I used Mamba-in-Llama’s initialization and knowledge distillation instead of MOHAWK’s multi-step procedure. This is because the latter did not show a significant advantage in preliminary experiments.
Why did we SFT only the H-30 model?To verify that distilled hybrids can be improved through standard post-training, we applied SFT only to the H-30. Other checkpoints are pure distillations, but can be tweaked as well.
Why didn’t you look into RL? This was a scoping decision to isolate the question of distillation: Can inferences be conveyed by distillation of knowledge alone? Answer: Yes. However, RL needs to further close the remaining quality gaps. We are considering RL for future iterations.
Did we really show that Aprilel-H1 matches full-attention inference on a similar computing budget? We did not perform an apples-to-apples comparison between Aprilel-H1 with full attention and a hybrid trained identically from pre-training. This would require repeating everything during and after teacher training using the Aprilel-H1 architecture, which was beyond our computing budget. What we can argue, however, is that improving efficiency through distillation is practical and effective, and that the resulting hybrids can be fine-tuned to match or exceed the quality of teacher reasoning.
production reality
I implemented Aprilel-H1 in Hug Face Transformer and vLLM. Integrating Transformers is easy. We ship a new model class with replaceable attention and Mamba layers. vLLM integration uses modern Mamba caching operations for continuous batch processing, prefix caching, and chunk prefilling. The vLLM plugin is now ready. We are currently awaiting final legal approval to open source.
Honest review: Going hybrid today means a rough patch. Tools are rapidly maturing, but not immediately available. Write custom code, carefully verify numerical behavior, and work around framework limitations. For teams that can absorb that cost, the increased throughput is worth it. For those who are unable to do so, waiting may be the right choice.
remove
Most teams do not have infinite compute for pre-training 20T tokens. If you have invested in a strong base model and need to improve efficiency, this work provides a practical path forward. This means hybrid extraction using high-quality, task-specific data that matches the features you have.
The surprising discovery of using inferential data to extract inferences seems obvious in hindsight, but contradicts initial intuition. We’ve validated it, explained why it works, and built the infrastructure to make it reproducible.
try out
Model: Apriel-H1 Collection by HuggingFace
Training Framework: Fast-LLM on GitHub
Teacher model: Aprilel-Nemotron-15B-Thinker
Paper: Aprilel-H1: Towards an efficient corporate inference model
Found something broken? Submit a problem. Have you discovered a better layer placement heuristic? Please let me know. Have you built anything interesting with Aprilel-H1? We’d love to see it.
Quote:
@article{apriel-h1-2025, title={Apriel-H1: Towards efficient enterprise inference models}, author={SLAM Lab, ServiceNow}, journal={arXiv preprint arXiv:2511.02651}, year={2025} }
Main contributors: Oleksiy Ostapenko, Luke Kumar, Raymond Li, Denis Kocetkov, Joel Lamy-Poirier, Torsten Scholak
Contributors: Shruthan Radhakrishna, Soham Parikh, Shambhavi Mishra
Technical co-leads: Torsten Scholak, Sathwik Tejaswi Madhusudhan

