How NVIDIA builds open data for AI

A collaborative approach to scaling trustworthy AI systems and agents.

Advances in AI are often framed in terms of model functionality and efficiency. In reality, all training pipelines ultimately rest on a data layer that determines the behavior of the model.

As agent systems become more autonomous, what they know, how they reason, and what they can safely do will increasingly be determined by the data on which they are trained. However, much of today’s training data remains opaque, fragmented, or siled across teams.

Open data access changes that equation. This gives developers a faster and more cost-effective path to building high-quality models while making it easier to evaluate and improve across the ecosystem. This is why NVIDIA releases open datasets alongside open models, tools, and training techniques.

AI data bottleneck

Building high-quality datasets remains one of the biggest bottlenecks in AI development. Organizations often spend millions of dollars and months, even more than a year, collecting, annotating, and validating data before starting a single model training run. Even when models are deployed, access to domain expertise and evaluation frameworks remains a perennial challenge.

NVIDIA aims to alleviate this friction by publishing a permissively licensed dataset on HuggingFace that provides training recipes and evaluation frameworks on GitHub that developers can quickly build on. To date, we have shared over 2 petabytes of AI-enabled training data across over 180 datasets and over 650 open models. And we’re just getting started.

Real-world open datasets

NVIDIA’s open data releases span multiple domains, from robotics and autonomous systems to sovereign AI, biology, and evaluation benchmarks. Built by teams across NVIDIA, these datasets demonstrate how shared data can accelerate real-world AI development.

Here are some examples from across the ecosystem.

Physics AI collection

Robotics systems require structured, multimodal data. This collection includes 15 TB of multimodal data, including more than 500,000 robot trajectories, 57 million grasps, and assets used to develop the NVIDIA GR00T inferential vision-language-action model across multiple gripper types and sensor configurations. This dataset has been downloaded more than 10 million times, including by companies such as Runway, which used the open GR00T dataset to develop the recently released GWM-Robotics world model, and Lightwheel, a robotics simulation company that uses this dataset to refine its robotics policies.

The collection also includes one of the most geographically diverse AV datasets available, with over 1,700 hours of multi-sensor data including seven camera configurations and LiDAR and radar across 25 countries and over 2,500 cities. Its breadth supports perceptual benchmarking across a variety of driving environments and complements academic datasets with broader commercial availability.

Nemotron Persona Collection

Nemotron Personas are fully synthetic persona datasets based on real-world population distributions that generate culturally authentic and diverse individuals across regions and languages at scale.

This collection supports sovereign AI development and currently includes the following population-scale datasets:

US – 6 million personas Japan – 6 million personas India – 21 million personas Brazil – 6 million personas (developed at WideLabs) Singapore – 888,000 personas (developed at AI Singapore)

These datasets are already being leveraged in real-world deployments around the world. CrowdStrike improved NL→CQL conversion accuracy from 50.7% to 90.4% using 2 million personas. In Japan, NTT Data and APTO used the dataset to bootstrap domain-specific intelligence with minimal proprietary data, increasing legal QA accuracy from 15.3% to 79.3% and reducing attack success rates from 7% to 0%.

The dataset also supported the development of NVIDIA Nemotron-Nano-9B-v2-Japanese, a cutting-edge sub-10B model that reached the top of the Nejumi leaderboard.

la proteina

La Proteina is a fully synthetic, atomistic protein dataset designed for biological modeling and drug discovery workflows. With 455,000 structures and 73% more state-of-the-art structural diversity than previous baselines, it provides ready-to-design molecular representations without PII or licensing constraints. Scientific results made possible through open collaboration with researchers at Oxford, Mira and CIFAR.

speed bench

SPEED-Bench is a standardized benchmark for evaluating speculative decoding performance. It features two divisions. Qualitative splitting to maximize semantic diversity across 11 text categories, and throughput splitting organized into input sequence length buckets (1K to 32K) to build accurate throughput Pareto curves using real semantic data rather than random tokens. Already adopted internally as the primary benchmark for Nemotron MTP performance, SPEED-Bench provides teams with a consistent methodology for evaluating draft performance across prompt complexity and context length.

Acquisition-Synthesis-NVDocs-v1

This synthetic search dataset provides 110,000 triplets of queries, passages, and answers generated from 15,000 files of NVIDIA public documents. Designed to train and evaluate embedded and RAG systems, this dataset features semantically rich QA pairs that span multiple inference types: factual, relational, procedural, inferential, temporal, causal, and visual, as well as diverse query types such as structural, multihop, and contextual queries. Intradomain fine-tuning of the embedding model shows significant improvement. Tweaking nvidia/llama-nemotron-embed-1b-v2 on this dataset resulted in an 11% increase in NDCG@10. The dataset can be generated in about 3-4 days, and fine-tuning takes about 2 hours on 8×A100 GPUs. This allows for rapid iteration from dataset to deployed model.

Nemotron-ClimbMix

ClimbMix is a pre-training dataset of 400B tokens built using the CLIMB algorithm. This algorithm uses embedding-based clustering and iterative refinement to identify high-quality data mixtures for training language models. This dataset has already received strong attention in the community. Andrej Karpathy highlighted that Nemotron-ClimbMix brought the biggest improvement in the Time-to-GPT-2 leaderboard and was adopted as the default data recipe for NanoChat Speedrun, reducing H100 calculation time by about 33% compared to the previous FineWeb-Edu setup. ClimbMix is released under the CC-BY-NC-4.0 license.

These releases reflect continued investment in the shared reference layer that AI developers rely on across modalities and model lifecycle stages.

Nemotron training dataset

One of the key components of NVIDIA’s open data work is the set of datasets used to train and tune the Nemotron family of models. Over the past year, these datasets have evolved to better support inference, coding, and multilingual features in frontier language models.

Evolution of Nemotron Pre-Training

Nemotron Pre-Training Evolution Chart

While previous releases relied heavily on general web corpora, the new release focuses on more important areas such as math, code, and STEM knowledge. This increase in signal density allows the model to learn stronger reasoning and problem-solving abilities.

The Nemotron pre-training stack includes several carefully selected datasets designed for different functions.

Nemotron-CC – Rewrite globally deduplicated web data to increase signal density Nemotron-CC-Math and Nemotron-CC-Code – Mathematics and code reasoning that preserves LaTeX and code formatting Nemotron-Pretraining-Code – Curated programming datasets from large code repositories Nemotron-Pretraining-Specialized – Algorithms, economics, logic, STEM Synthetic datasets that power key areas such as inference

Together, these datasets form the basis for general-purpose language models capable of reasoning, coding, and multilingual understanding. These are installed not only in Nemotron, but also in Frontier models from partners such as AI security company Trend Micro’s Primus-Labor-70B.

Evolution of Nemotron after training

Evolution chart after Nemotron training

As a model’s capabilities improve, post-training data plays a larger role in shaping the model’s behavior. The new release highlights multilingual versatility, structured inference monitoring, and agent-style interaction data.

Key datasets in the Nemotron post-training stack include:

Nemotron-struction-following-chat – Structured conversational monitoring Nemotron-Science – Comprehensive scientific reasoning dataset Nemotron-Math-Proofs – Formal mathematical reasoning dataset Nemotron-Agentic – Dataset supporting multi-step planning and tool usage Nemotron-SWE – Instruction coordination dataset for software engineering tasks

These datasets provide structured supervision that helps models follow complex instructions, generate inference traces, and reliably perform multi-step tasks. Early iterations were blended with domain data to develop ServiceNow’s Apriel Nemotron 15B / Apriel 1.6 Thinker, which outperforms Gemini 2.5 Flash and Qwen3 on the 15B parameter scale, and Hugging Face’s SmolLM3, a popular small language model.

NVIDIA is also extending this work with open safety and reinforcement learning datasets, including Nemotron-Agentic-Safety, an 11K labeled telemetry trace from tool usage workflows, and Nemotron-RL, a 900K task corpus spanning math, coding, tools, puzzles, and inference that provides a true training “gym” for models.

extreme co-design

Designing high-quality datasets at this scale is a team sport. It requires close collaboration between data strategists, AI researchers, infrastructure engineers, and policy experts.

At NVIDIA, we approach data the same way we approach software and hardware engineering problems through what we call extreme co-design, where all components are designed together to eliminate bottlenecks at scale.

Release the dataset and the methods behind it when possible. The open community and partners then stress test, uncover edge cases, and extend the dataset to new domains. These insights feed directly into the next iteration, improving both internal systems and the broader AI ecosystem.

CES 2026 keynote speech

NVIDIA also works with partners through initiatives such as ViDoRe and CVDP. The two consortia bring together industry and academic partners to develop open benchmarking and evaluation frameworks for emerging AI systems.

Start cooking in the open kitchen

At NVIDIA, we think of open data like an open kitchen. Ingredients are displayed, recipes are shared, and everyone can learn from how to make the dish.

If you’re passionate about data science and model building, explore NVIDIA’s open datasets on Hugging Face, try our tutorials and Nemotron labs, and join the Nemotron community on Discord to collaborate on future datasets.

The next generation of trusted AI models and agent systems will be built on a shared foundation. Open data is one of them.

versatileai

See Full Bio

What's Hot

A billion-dollar startup with different ideas for AI

The future of AI and cybersecurity: Why openness matters

How AI models use real-time cryptocurrency data to interpret market movements

A billion-dollar startup with different ideas for AI

The future of AI and cybersecurity: Why openness matters

How AI models use real-time cryptocurrency data to interpret market movements

Diffusers welcome stable spread 3

Disney invests $1 billion in OpenAI, licenses over 200 characters for Sora AI tool

Run VLM on Intel CPUs in 3 easy steps

Most Popular

Diffusers welcome stable spread 3

Disney invests $1 billion in OpenAI, licenses over 200 characters for Sora AI tool

Run VLM on Intel CPUs in 3 easy steps

Don't Miss

A billion-dollar startup with different ideas for AI

The future of AI and cybersecurity: Why openness matters

How AI models use real-time cryptocurrency data to interpret market movements

Subscribe to Updates

What's Hot

How NVIDIA builds open data for AI

AI data bottleneck

Real-world open datasets

Physics AI collection

Nemotron Persona Collection

la proteina

speed bench

Acquisition-Synthesis-NVDocs-v1

Nemotron-ClimbMix

Nemotron training dataset

Evolution of Nemotron Pre-Training

Evolution of Nemotron after training

extreme co-design

Start cooking in the open kitchen

Related Posts