Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

How C3 AI agents automate predictive maintenance for Shell

June 5, 2026

How E.ON modernizes the grid with AI using SAP S/4HANA

June 4, 2026

GitHub Copilot users experience token-based price increases

June 2, 2026
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Sunday, June 7
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»NVIDIA releases 6 million multilingual inference datasets
Tools

NVIDIA releases 6 million multilingual inference datasets

versatileaiBy versatileaiMay 18, 2026No Comments4 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

Authors: Dhruv Nathwani, Shuoyang Ding US, Vitaly Lavrukhin US, Jane Polak Scowcroft US, Oleksii Kuchaiev US

NVIDIA continues to release permissive datasets supporting an open ecosystem with 6 million multilingual inference datasets.

Continuing the success of the recent Nemotron post-training dataset v1 release used in the Llama Nemotron Super model and the Llama Nemotron post-training dataset release earlier this year, we are pleased to release an inference dataset translated into five target languages: French, Spanish, German, Italian, and Japanese.

The newly released NVIDIA Nemotron Nano 2 9B brings these capabilities to the edge with cutting-edge precision and efficiency through its hybrid Transformer-Mamba architecture and configurable thought budget. So you can adjust accuracy, throughput, and cost to suit your actual needs.

Model highlights (TL;DR)

Model size: 9B parameters Architecture: Hybrid Transformer-Mamba (Mamba-2 + fewer attention layers) for higher throughput with similar accuracy to Transformer-only peers Throughput: Up to 6x higher token generation than other leading models in the same size class Cost: Thought budget lets you control the number of “think” tokens used, reducing inference costs by up to 60% Target: Customer Service, Support Chatbots, Agent CoPilot for Analytics, and Edge/RTX Deployments Availability: Model weights are available at Hugging Face and you can try out the endpoints at build.nvidia.com. The model will also be available as a high-throughput, low-latency NVIDIA NIM License: nvidia-open-model-license

This release represents a significant step forward in our continued commitment to openness and transparency in model development and improvement. NVIDIA supports continuous improvement of open weight models by releasing training data in addition to training tools and final model weights.

What is a dataset and how is it constructed?

As an overview, Nemotron Post-Training Dataset V2 takes previously released English inference data and translates it into five target languages: French, German, Italian, Japanese, and Spanish. To make the most of the English knowledge inculcated during pre-training, we translate the user’s prompts and model responses while preserving the original English reasoning chain.

According to the WMT 2024 general translation shared task results, LLM achieves state-of-the-art results for machine translation tasks. However, regarding synthetic generation of post-training data, preliminary research has revealed the following:

LLM is more prone to hallucinations when translating SFT datasets compared to translating common machine translation test sets (such as FLORES). The translation quality and hallucination rate of open source LLM decrease significantly as the input length increases.

Therefore, several mechanisms are built in to maintain high translation quality and easily detect hallucinations. In summary:

Translate sentences line by line, separating them with line breaks. If a line is not translatable (for example, just a tab) or is part of a block of code, it will not be translated. Force a specific format (“enclose translated text in parentheses 〘〙”) and use this special matching parenthesis to extract the translation. Other examples are discarded (see Table 1). Run fastText language ID on the translation of the prompt input and exclude off-target data points. We discarded an additional 55,567 examples (another 1.1% of all multilingual examples).

Table 1: Percentage of data discarded due to forced output format (in bytes)

Language code qa math de 2.28% 1.11% 2.47% es 26.14% 5.15% 6.38% fr 11.01% 1.37% 1.96% it 4.94% 1.36% 0.75% ja 7.68% 2.51% 3.86%

After benchmarking, I selected Qwen2.5-32B-Instruct-AWQ (for German) and Qwen2.5-14B-Instruct (other) to perform the translation. Here are some considerations when choosing these models:

Robust translation quality Can be adapted to a single A100 GPU for inference Covers a wide range of domains of training data Open license (Apache 2.0)

How to use

from dataset import load dataset ds = load dataset(“nvidia/Nemotron-Post-Training-Dataset-v2”)

👉 Explore the dataset here: Hug Face Dataset Page

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleHugging Face hosts malicious software disguised as OpenAI release
Next Article AI is a power, infrastructure and security issue: TechEx North America
versatileai

Related Posts

Tools

How C3 AI agents automate predictive maintenance for Shell

June 5, 2026
Tools

How E.ON modernizes the grid with AI using SAP S/4HANA

June 4, 2026
Tools

GitHub Copilot users experience token-based price increases

June 2, 2026
Add A Comment

Comments are closed.

Top Posts

TCL launches A400 Pro QD-Mini LED Art TV with 4K 144Hz, AI art generation, and gallery-style design

November 30, 202595 Views

Switzerland releases its own completely open AI model

September 4, 202571 Views

The Colorado AI Act was delayed until June 2026

September 21, 202558 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

TCL launches A400 Pro QD-Mini LED Art TV with 4K 144Hz, AI art generation, and gallery-style design

November 30, 202595 Views

Switzerland releases its own completely open AI model

September 4, 202571 Views

The Colorado AI Act was delayed until June 2026

September 21, 202558 Views
Don't Miss

How C3 AI agents automate predictive maintenance for Shell

June 5, 2026

How E.ON modernizes the grid with AI using SAP S/4HANA

June 4, 2026

GitHub Copilot users experience token-based price increases

June 2, 2026
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?