Compact hybrid model for efficient local AI

We are pleased to introduce the Nemotron 3 Nano 4B, the newest and most compact member of the Nemotron 3 family. Leveraging the hybrid Mamba-Transformer architecture, this model is designed for efficiency and accuracy in its target feature set, establishing a new standard for lightweight small language models. This model is available on any NVIDIA GPU-enabled platform and combines state-of-the-art instruction following with superior tooling in a minimal VRAM footprint.

With just 4 billion parameters, Nemotron 3 Nano 4B is compact enough to run at the edge on NVIDIA Jetson platforms (Jetson Thor/Jetson Orin Nano) and NVIDIA DGX Spark and NVIDIA RTX GPUs. This enables faster response times, greater data privacy, and flexible deployment while keeping inference costs low.

Nemotron 3 Nano 4B is the first model specifically optimized for on-device deployments, designed to power local conversational agents and personas across GeForce RTX, Jetson, and Spark customer use cases. This model delivers state-of-the-art accuracy and efficiency in several aspects important for production use at the edge.

Instruction Following (IFBench, IFEval): Most advanced in its size class Gaming/Information Agency (Orak): Most advanced in its size class VRAM Efficiency (Peak Memory Usage): Lowest VRAM footprint in its size class (*1) for both low and high ISL/OSL settings Latency: Lowest TTFT (*1) in its size class for high ISL settings

(*1) Efficiency benchmarks were measured on RTX 4070 using Llama.cpp with Q4_K_M quantized versions of both models.

Additionally, the Nemotron 3 Nano 4B delivers superior tool usage performance and is highly competitive in hallucination avoidance. Taken together, these features make this model a strong fit for edge use cases.

Nemotron 3 Nano 4B is pruned and distilled from Nemotron Nano 9B v2 using the Nemotron Elastic framework, allowing it to inherit powerful inference capabilities as a hybrid inference model. Further post-training was performed using new recipes derived from Nemotron 3’s post-training data, allowing the model to excel at task solving without explicit thinking.

Finally, as an open source model, it allows the ecosystem to customize, fine-tune, and optimize for domain-specific use cases.

For Orak, we evaluated models from tactical games such as Super Mario, Darkest Dungeon, and Stardew Valley.

Nemotron 3 Nano 4B Training Recipes

Compress 9B → 4B with Nemotron Elastic

Nemotron 3 Nano 4B is derived from Nemotron Nano 9B v2 using Nemotron Elastic technology. Rather than training a 4B model from scratch or performing separate stages of pruning, candidate search, and extraction, as existing LLM compression techniques do, Nemotron Elastic uses structured pruning guided by routers. This pruner is jointly trained with the model using an auxiliary loss that addresses the size of the Student model and the original knowledge extraction loss. This technology enables optimal student models at a fraction of the cost of pre-training from scratch or traditional compression.

How the router decides what to prune

Nemotron Elastic introduces an end-to-end trained router that performs knowledge distillation as well as neural architecture search across multiple compression axes. For Nano 4B, the framework is used in a single-budget configuration (targeting only the 4B parameter count), and the role of the router is to decide which axis to prune and how much to reach the target budget.

The router was given four pruning axes to choose from.

Mamba heads — reduce the number of SSM heads Hidden dimensions (embedding dimensions) — reduce the overall representation width of the model FFN channels — prune intermediate neurons in MLP layers Depth (layers) — remove entire layers from the network

For each width axis, prior knowledge about component importance was provided to the router by sorting channels, heads, and neurons according to activation-based importance scores. For depth, a normalized MSE-based layer importance ranking was used. Each layer was iteratively removed and its impact on the output logit of the complete model was measured, giving a principled order of which layers were most important. For more information, see the Nemotron Elastic paper. Considering a target parameter budget of 4B, the router converged on the following pruning decision:

Axis Nemotron Nano 9B v2 (Parent) Nemotron 3 Nano 4B Depth 56 Layers (27 Mamba, 4 Attention, 25 MLP) 42 Layers (21 Mamba, 4 Attention, 17 MLP) Mamba Head 128 96 FFN Intermediate Dim 15680 12544 Embedding Dim 4480 3136

Two-stage distillation for precision recovery

After the router determines the pruned architecture, the compressed model is retrained using knowledge distillation from the frozen 9B parent using Nano v2 pre- and post-training data. This accuracy recovery process is performed in two stages.

Stage 1 — Short context distillation (8K sequence length): The 4B model is trained on 63B tokens using an 8K context window, using a data blend consisting of approximately 70% post-training data and 30% pre-training data from the parent Nano v2 recipe. This step is essential for the initial recovery of model accuracy after compression. Stage 2 — Long Context Extension (Sequence Length 49K): The context is extended to 49K tokens to recover performance for more difficult tasks that require extended inference chains. At this stage, the model is trained on 150 B tokens.

Supervised fine-tuning

Megatron-LM was used to perform two stages of SFT using relevant subsets of the Nemotron-Post-Training-v3 collection. The first stage of SFT trains the model using a combination of inferential and non-inferential data across a variety of domains, including math, coding, science, chat, following instructions, and agent tasks. The second stage is a small, focused training session to reinforce safety behaviors.

Reinforcement learning in multiple environments

Once the model is bootstrapped with SFT, we switch to a three-stage RL pipeline using NeMo-RL to target areas of focus, instruction following, and tool invocation/agent behavior. The first stage uses single-turn command-following data. The second stage uses the NeMo-Gym environment for single-turn and multi-turn instruction following and structured output (JSON, XML). Finally, the third stage uses a preliminary version of Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1 for multi-turn conversation tool calls. A balanced 50:50 ratio of inferential and non-inferential data was used across the three RLVR stages, with the KL penalty gradually increasing at each stage.

Improving efficiency through quantization

For edge devices, further reduction of model size through quantization is essential to improve efficiency and reduce VRAM usage. Nemotron 3 Nano 4B was released in FP8 and Q4_K_M GGUF to increase efficiency in edge devices.

For the FP8 model, we applied post-training quantization (PTQ) using the ModelOpt library. For the PTQ calibration dataset, we estimated activation statistics using a small subset of 1K samples from the post-training SFT dataset to minimize quantization-related accuracy loss. To maintain accuracy while improving efficiency, we also applied a selective quantization strategy rather than quantizing the entire network. Comparing a set of quant configurations, we found that maintaining a self-attention layer (4 out of 42 layers) and 4 Mamba layers preceding the self-attention layer in BF16 provides a sweet spot in the trade-off between accuracy recovery and efficiency gains. Model weights, activations, and KV caches are quantized to FP8. Conv1D in all Mamba layers is kept in BF16. The FP8 model achieved 100% median accuracy recovery across target benchmarks compared to the BF16 model. The FP8 quantized version has improved latency and throughput by up to 1.8x compared to the original BF16 version on DGX Spark and Jetson Thor.

Llama.cpp supports the widely adopted GGUF quantization scheme Q4_K_M, a 4-bit scheme that provides a good balance between efficiency and accuracy. The Q4_K_M GGUF version achieved 100% median accuracy recovery across target benchmarks compared to the BF16 model.

This GGUF release is also suitable for Jetson deployments. On the Jetson Orin Nano 8GB, designed for small embedded devices, the Q4_K_M checkpoint running on Llama.cpp achieves 18 tokens/sec, delivering up to 2x higher throughput than the Nemotron Nano 9B v2. This highlights the efficiency of the Nemotron 3 Nano 4B in edge inference in embedded AI and robotics use cases.

Try it now!

Nemotron 3 Nano 4B is available with various inference engines such as Transformers, vLLM, TRT-LLM, and Llama.cpp, enabling support for a wide range of edge deployment scenarios.
First, visit the Hugging Face repository below and download the model checkpoint. Examples of how to use Hugging Face Transformers, vLLM, TRT-LLM, and Llama.cpp are available on the model card.

For Jetson, you can find step-by-step instructions and ready-to-run commands on the Jetson AI Lab model page.

Also, check out the NVIDIA In-Game Inferencing (NVIGI) SDK to speed up inference performance when running models alongside heavy graphics workloads.

versatileai

See Full Bio

What's Hot

NHS AI blood test could reduce invasive uterine cancer testing

How to shrink your token budget without downsizing your team

Native-speed vLLM Transformer Modeling Backend

NHS AI blood test could reduce invasive uterine cancer testing

How to shrink your token budget without downsizing your team

Native-speed vLLM Transformer Modeling Backend

Physical AI Conference Held in San Jose as Robotics and Autonomous AI Go Mainstream

OpenAI Frontier collides enterprise AI agents with SaaS

Most Popular

Physical AI Conference Held in San Jose as Robotics and Autonomous AI Go Mainstream

OpenAI Frontier collides enterprise AI agents with SaaS

Don't Miss

NHS AI blood test could reduce invasive uterine cancer testing

How to shrink your token budget without downsizing your team

Native-speed vLLM Transformer Modeling Backend

Subscribe to Updates

What's Hot

Compact hybrid model for efficient local AI

Nemotron 3 Nano 4B Training Recipes

Compress 9B → 4B with Nemotron Elastic

How the router decides what to prune

Two-stage distillation for precision recovery

Supervised fine-tuning

Reinforcement learning in multiple environments

Improving efficiency through quantization

Try it now!

Related Posts