Authors: Dhruv Nathwani, Shuoyang Ding US, Vitaly Lavrukhin US, Jane Polak Scowcroft US, Oleksii Kuchaiev US
NVIDIA continues to release permissive datasets supporting an open ecosystem with 6 million multilingual inference datasets.
Continuing the success of the recent Nemotron post-training dataset v1 release used in the Llama Nemotron Super model and the Llama Nemotron post-training dataset release earlier this year, we are pleased to release an inference dataset translated into five target languages: French, Spanish, German, Italian, and Japanese.
The newly released NVIDIA Nemotron Nano 2 9B brings these capabilities to the edge with cutting-edge precision and efficiency through its hybrid Transformer-Mamba architecture and configurable thought budget. So you can adjust accuracy, throughput, and cost to suit your actual needs.
Model highlights (TL;DR)
Model size: 9B parameters Architecture: Hybrid Transformer-Mamba (Mamba-2 + fewer attention layers) for higher throughput with similar accuracy to Transformer-only peers Throughput: Up to 6x higher token generation than other leading models in the same size class Cost: Thought budget lets you control the number of “think” tokens used, reducing inference costs by up to 60% Target: Customer Service, Support Chatbots, Agent CoPilot for Analytics, and Edge/RTX Deployments Availability: Model weights are available at Hugging Face and you can try out the endpoints at build.nvidia.com. The model will also be available as a high-throughput, low-latency NVIDIA NIM License: nvidia-open-model-license
This release represents a significant step forward in our continued commitment to openness and transparency in model development and improvement. NVIDIA supports continuous improvement of open weight models by releasing training data in addition to training tools and final model weights.
What is a dataset and how is it constructed?
As an overview, Nemotron Post-Training Dataset V2 takes previously released English inference data and translates it into five target languages: French, German, Italian, Japanese, and Spanish. To make the most of the English knowledge inculcated during pre-training, we translate the user’s prompts and model responses while preserving the original English reasoning chain.
According to the WMT 2024 general translation shared task results, LLM achieves state-of-the-art results for machine translation tasks. However, regarding synthetic generation of post-training data, preliminary research has revealed the following:
LLM is more prone to hallucinations when translating SFT datasets compared to translating common machine translation test sets (such as FLORES). The translation quality and hallucination rate of open source LLM decrease significantly as the input length increases.
Therefore, several mechanisms are built in to maintain high translation quality and easily detect hallucinations. In summary:
Translate sentences line by line, separating them with line breaks. If a line is not translatable (for example, just a tab) or is part of a block of code, it will not be translated. Force a specific format (“enclose translated text in parentheses 〘〙”) and use this special matching parenthesis to extract the translation. Other examples are discarded (see Table 1). Run fastText language ID on the translation of the prompt input and exclude off-target data points. We discarded an additional 55,567 examples (another 1.1% of all multilingual examples).
Table 1: Percentage of data discarded due to forced output format (in bytes)
Language code qa math de 2.28% 1.11% 2.47% es 26.14% 5.15% 6.38% fr 11.01% 1.37% 1.96% it 4.94% 1.36% 0.75% ja 7.68% 2.51% 3.86%
After benchmarking, I selected Qwen2.5-32B-Instruct-AWQ (for German) and Qwen2.5-14B-Instruct (other) to perform the translation. Here are some considerations when choosing these models:
Robust translation quality Can be adapted to a single A100 GPU for inference Covers a wide range of domains of training data Open license (Apache 2.0)
How to use
from dataset import load dataset ds = load dataset(“nvidia/Nemotron-Post-Training-Dataset-v2”)
👉 Explore the dataset here: Hug Face Dataset Page

