Large-scale language models (LLMs) have become the default interface for code generation, mathematical problem solving, summarization, document understanding, and many other developer workflows. However, under the hood, many LLMs still generate text in the same way. That is, they are generated one token at a time, and each token depends on the token that appeared before it. These models are therefore called autoregressive because they consume their own output.
This autoregressive (AR) approach has achieved remarkable success. Training is stable, easy to deliver, and has contributed to many of the advances in modern language modeling. However, it also creates hard limits. Every new token requires a complete model pass, and all weights must be loaded from memory before starting calculations. For developers building latency-sensitive applications, running smaller batch sizes, or trying to take advantage of modern GPUs, per-token generation can result in poor performance because most of the GPU’s time is spent on memory operations rather than computation.
Furthermore, once a token is generated by an autoregressive model, it is final and there is essentially no ability to modify previous tokens. As a result, mistakes can be propagated during generation.
Nemotron-Labs Diffusion introduces a new avenue: Diffuse Language Models (DLMs) that work by generating multiple tokens in parallel and iteratively refining the generated tokens in multiple steps. These models not only better exploit the computational models of modern GPUs and provide significant runtime performance benefits, but also allow the generated tokens to be modified to make them better suited for addressing existing text correction and intermediate completion goals. This generation and adjustment property also provides a built-in way to control the inference budget. By reducing the number of refinement steps, you can reduce the computational requirements of these models at runtime.
Quick links to models, training recipes, and technical reports
The Nemotron-Labs Diffusion family includes 3B, 8B, and 14B scale text models, all available under the commercially available NVIDIA Nemotron Open Model License. The 8B scale Vision Language Model (VLM) is also available under the NVIDIA source code license, allowing for extensive research flexibility. NVIDIA is releasing both base models and instruction-tuned chat variants across the lineup. NVIDIA has also released code to train these models through the NVIDIA Megatron Bridge framework.
One model, three generation modes
Nemotron-Labs Diffusion is designed around the simple idea that autoregressive and diffusion generation should not be separate model families. These must be features of the same model. The model supports three generation modes.
Autoregressive mode is performed similarly to standard left-to-right LLM. This maintains compatibility with generation workflows that developers already know.
Diffuse mode generates tokens block by block, gradually over multiple steps.
Autospeculation mode uses diffusion to draft multiple candidate tokens and autoregressive decoding to validate them. This combines the potential speed of popular-style drafting with the reliability of AR validation.
This flexible design is a key feature for developers where both speed and accuracy are important, even for workloads with unpredictable batch sizes or single queries (batch size = 1). This is a deployment-time setting, so few changes are required at the application level to select the desired inference mode. So developers can seamlessly switch between their current model and Nemotron-Labs Diffusion for different inference modes, achieving lightning-fast generation speeds.
Performance highlights
Nemotron-Labs Diffusion 8B achieves an average of 1.2% improved accuracy compared to Qwen3 8B. Comparing the inference speed (abbreviated TPF, a hardware-independent means of measuring token decoding efficiency) measured in tokens per forward pass, the diffuse mode reaches a 2.6x higher TPF than the AR model, while self-guessing pushes it further to 6x for linear self-guessing and 6.4x for quadratic self-guessing, giving comparable accuracy across the tasks evaluated.
How we trained Nemotron-Labs to spread
Diffuse language models have shown promise for years, but have historically faced practical barriers, such as being less accurate than powerful AR models, more difficult to train, and limited compatibility with KV caches.
Recent research has changed that direction. Efficient-DLM showed that a pre-trained AR model can be converted into a diffuse language model by continuing the pre-training and changing the attention mechanism to a block-wise approach. This design helps preserve the functionality of the AR model while allowing parallel decoding suitable for KV caching.
Nemotron-Labs Diffusion is built on the same practical insights and adds diffusion functionality to existing AR models. The model was trained using a joint goal of AR and diffusion, so that diffusion added parallel drafting capabilities while retaining what was learned during the initial AR training. The model was pre-trained with 1.3T tokens from the NVIDIA Nemotron pre-training dataset and underwent an additional supervised fine-tuning phase using 45 billion tokens from the NVIDIA Nemotron post-training dataset.
Deployment and inference with SGLang
Deployment of Nemotron-Labs diffusion models will soon be supported in the main branch of SGLang. As of this writing, inference support is available through this issue tracker request on GitHub.
What’s great is that this integration allows you to provide the same checkpoint in three different ways, selected in one line within the algorithm settings.
Simple autoregression – When set to ar_mode=true, the model behaves like any other causal LM. Useful as an accuracy reference or when you need a sanity check on pure AR output.
Diffuser mode (FastDiffuser) – Raw throughput headliner. The model uses iterative denoising to fill blocks of 32 tokens at a time, with a confidence threshold determining “enough” tokens to commit each step.
Self-speculation (LinearSpec) – This is our favorite. The same model drafts blocks in both directions and causally validates them. Anything with a matching prefix will be committed. The output at temperature 0 is lossless compared to AR, but reached ~865 tok/s on B200 in the speedbench dataset. This is approximately 4 times faster than the autoregressive baseline on the same hardware.
Get started now
Nemotron-Labs Diffusion brings diffusion-style generation to a developer-ready format, including an open model, familiar AR compatibility, diffuse decoding, and self-speculative acceleration. Nemotron-Labs Diffusion gives developers new ways to create, refine, validate, and accelerate text without changing their applications.
Get started by exploring the Nemotron-Labs diffusion model family, reading technical reports, and trying available training recipes.