Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Nemotron-Labs Towards light-speed text generation using a diffuse language model

May 24, 2026

Simulate real-world locations with Project Genie and Street View

May 23, 2026

AI allows China to see its energy grid with God’s eyes. No one else has this mapping.

May 23, 2026
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Sunday, May 24
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»Nemotron-Labs Towards light-speed text generation using a diffuse language model
Tools

Nemotron-Labs Towards light-speed text generation using a diffuse language model

versatileaiBy versatileaiMay 24, 2026No Comments6 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

Large-scale language models (LLMs) have become the default interface for code generation, mathematical problem solving, summarization, document understanding, and many other developer workflows. However, under the hood, many LLMs still generate text in the same way. That is, they are generated one token at a time, and each token depends on the token that appeared before it. These models are therefore called autoregressive because they consume their own output.

This autoregressive (AR) approach has achieved remarkable success. Training is stable, easy to deliver, and has contributed to many of the advances in modern language modeling. However, it also creates hard limits. Every new token requires a complete model pass, and all weights must be loaded from memory before starting calculations. For developers building latency-sensitive applications, running smaller batch sizes, or trying to take advantage of modern GPUs, per-token generation can result in poor performance because most of the GPU’s time is spent on memory operations rather than computation.

Furthermore, once a token is generated by an autoregressive model, it is final and there is essentially no ability to modify previous tokens. As a result, mistakes can be propagated during generation.

Nemotron-Labs Diffusion introduces a new avenue: Diffuse Language Models (DLMs) that work by generating multiple tokens in parallel and iteratively refining the generated tokens in multiple steps. These models not only better exploit the computational models of modern GPUs and provide significant runtime performance benefits, but also allow the generated tokens to be modified to make them better suited for addressing existing text correction and intermediate completion goals. This generation and adjustment property also provides a built-in way to control the inference budget. By reducing the number of refinement steps, you can reduce the computational requirements of these models at runtime.

Quick links to models, training recipes, and technical reports

The Nemotron-Labs Diffusion family includes 3B, 8B, and 14B scale text models, all available under the commercially available NVIDIA Nemotron Open Model License. The 8B scale Vision Language Model (VLM) is also available under the NVIDIA source code license, allowing for extensive research flexibility. NVIDIA is releasing both base models and instruction-tuned chat variants across the lineup. NVIDIA has also released code to train these models through the NVIDIA Megatron Bridge framework.

One model, three generation modes

2-Try Mode-Final

Nemotron-Labs Diffusion is designed around the simple idea that autoregressive and diffusion generation should not be separate model families. These must be features of the same model. The model supports three generation modes.

Autoregressive mode is performed similarly to standard left-to-right LLM. This maintains compatibility with generation workflows that developers already know.

Diffuse mode generates tokens block by block, gradually over multiple steps.

Autospeculation mode uses diffusion to draft multiple candidate tokens and autoregressive decoding to validate them. This combines the potential speed of popular-style drafting with the reliability of AR validation.

This flexible design is a key feature for developers where both speed and accuracy are important, even for workloads with unpredictable batch sizes or single queries (batch size = 1). This is a deployment-time setting, so few changes are required at the application level to select the desired inference mode. So developers can seamlessly switch between their current model and Nemotron-Labs Diffusion for different inference modes, achieving lightning-fast generation speeds.

Performance highlights

Screenshot of May 22, 2026 15-49-43

Nemotron-Labs Diffusion 8B achieves an average of 1.2% improved accuracy compared to Qwen3 8B. Comparing the inference speed (abbreviated TPF, a hardware-independent means of measuring token decoding efficiency) measured in tokens per forward pass, the diffuse mode reaches a 2.6x higher TPF than the AR model, while self-guessing pushes it further to 6x for linear self-guessing and 6.4x for quadratic self-guessing, giving comparable accuracy across the tasks evaluated.

How we trained Nemotron-Labs to spread

Diffuse language models have shown promise for years, but have historically faced practical barriers, such as being less accurate than powerful AR models, more difficult to train, and limited compatibility with KV caches.

Recent research has changed that direction. Efficient-DLM showed that a pre-trained AR model can be converted into a diffuse language model by continuing the pre-training and changing the attention mechanism to a block-wise approach. This design helps preserve the functionality of the AR model while allowing parallel decoding suitable for KV caching.

Nemotron-Labs Diffusion is built on the same practical insights and adds diffusion functionality to existing AR models. The model was trained using a joint goal of AR and diffusion, so that diffusion added parallel drafting capabilities while retaining what was learned during the initial AR training. The model was pre-trained with 1.3T tokens from the NVIDIA Nemotron pre-training dataset and underwent an additional supervised fine-tuning phase using 45 billion tokens from the NVIDIA Nemotron post-training dataset.

Deployment and inference with SGLang

Deployment of Nemotron-Labs diffusion models will soon be supported in the main branch of SGLang. As of this writing, inference support is available through this issue tracker request on GitHub.

What’s great is that this integration allows you to provide the same checkpoint in three different ways, selected in one line within the algorithm settings.

Simple autoregression – When set to ar_mode=true, the model behaves like any other causal LM. Useful as an accuracy reference or when you need a sanity check on pure AR output.

Diffuser mode (FastDiffuser) – Raw throughput headliner. The model uses iterative denoising to fill blocks of 32 tokens at a time, with a confidence threshold determining “enough” tokens to commit each step.

Self-speculation (LinearSpec) – This is our favorite. The same model drafts blocks in both directions and causally validates them. Anything with a matching prefix will be committed. The output at temperature 0 is lossless compared to AR, but reached ~865 tok/s on B200 in the speedbench dataset. This is approximately 4 times faster than the autoregressive baseline on the same hardware.

Get started now

Nemotron-Labs Diffusion brings diffusion-style generation to a developer-ready format, including an open model, familiar AR compatibility, diffuse decoding, and self-speculative acceleration. Nemotron-Labs Diffusion gives developers new ways to create, refine, validate, and accelerate text without changing their applications.

Get started by exploring the Nemotron-Labs diffusion model family, reading technical reports, and trying available training recipes.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleSimulate real-world locations with Project Genie and Street View
versatileai

Related Posts

Tools

Simulate real-world locations with Project Genie and Street View

May 23, 2026
Tools

AI allows China to see its energy grid with God’s eyes. No one else has this mapping.

May 23, 2026
Tools

We are launching the Google DeepMind Accelerator program in Asia Pacific to address environmental risks.

May 22, 2026
Add A Comment

Comments are closed.

Top Posts

Pillar Security raises $9 million to create AI security guardrails for businesses

April 18, 202541 Views

Edimakor V4.2.0 unveils AI video tools at VEO 3

August 4, 202538 Views

10 Best AI for PowerPoint presentations

February 13, 202522 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Pillar Security raises $9 million to create AI security guardrails for businesses

April 18, 202541 Views

Edimakor V4.2.0 unveils AI video tools at VEO 3

August 4, 202538 Views

10 Best AI for PowerPoint presentations

February 13, 202522 Views
Don't Miss

Nemotron-Labs Towards light-speed text generation using a diffuse language model

May 24, 2026

Simulate real-world locations with Project Genie and Street View

May 23, 2026

AI allows China to see its energy grid with God’s eyes. No one else has this mapping.

May 23, 2026
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?