Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Gemini 2.0 Flash Native Image Generation Experiment

April 2, 2026

Inside the AI ​​agent strategy that helps companies improve their profitability

April 1, 2026

Storage bucket now available on Hug Face Hub

March 30, 2026
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Friday, April 3
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»Co-design data for sovereign AI
Tools

Co-design data for sovereign AI

versatileaiBy versatileaiJanuary 28, 2026No Comments5 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

brazil 2
A composite AI approach to Brazilian-Portuguese personas based on real-world distribution

Grounding Brazil’s AI with real data

Building AI systems that serve citizens requires data that reflects local languages, demographics, and cultural backgrounds. For Brazil, a country with more than 200 million people living in different regions, this issue remains a persistent challenge as much of today’s high-quality training data is English-centric or not commercially available.

Nemotron-Persona-Brazil helps bridge that gap. This is an open dataset (CC BY 4.0) of 6 million fully synthetic personas statistically based on official census and labor data from the Brazilian Institute of Geography and Statistics (IBGE). All personas are aligned to real-world demographics, geographic distribution, and occupational distribution, but do not represent real people.

This release expands on NVIDIA’s ever-expanding Nemotron-Persona collection, which already includes the US, Japan, India, and Singapore. Similar to other datasets in the collection, the Brazil dataset covers attributes such as age, gender, education, occupation, and location.

This dataset is designed for Brazilian developers and researchers building sovereign AI, and is locally rooted, culturally informed, and uses commercially available data (CC BY 4.0). It was built in collaboration with WideLabs, an NVIDIA Inception member with extensive experience supporting AI adoption in government and regulatory sectors across Latin America.

What does the dataset contain?

Screenshot 2026-01-27 PM 4.05.28

overview:

6 million Brazilian personas (1 million records each x 6 personas) Total approximately 1.4 billion tokens (including approximately 450 million persona tokens) 20 fields per record: 6 persona fields + 14 context fields based on official statistics Total geographic coverage: All 26 Brazilian states + Federal District Up to 457,000 Unique Portuguese names for over 1,500 occupational categories reflecting Brazil’s workforce and multiple persona types: professional, sports, arts, and travel among others.

Each persona is written in natural Brazilian Portuguese and includes their cultural background, skills, goals, hobbies, and interests.

Construction method

data generation pipeline

Nemotron-Personas-Brazil was built using NeMo Data Designer, NVIDIA’s complex AI system for synthetic data generation. This pipeline supports the structured generation, validation, and retry mechanisms required to generate large-scale, population-aware datasets.

Key components include:

Probabilistic Graphical Models for Statistical Reasoning (Apache-2.0) GPT-OSS-120B for Narrative Generation in Brazilian Portuguese (Apache-2.0)

An enhanced version of Nemotron-Personalas-Brazil is now available directly within NeMo Data Designer, allowing developers to generate, refine, and extend Brazilian-Portuguese personas as part of their own synthetic data pipelines.

enhanced cultural background

To understand the sociodemographic and geographic diversity and complexity of Brazil’s population, Nemotron-Persona-Brazil leveraged census and labor data published by the Brazilian Institute of Geography and Statistics (IBGE).

Geography – Personas are anchored at the state and local government level, reflecting regional differences across Brazil’s five macro-regions. Occupation – Go beyond job titles to include skills, expertise, and career trajectories such as micro-entrepreneurship or regional trade. Life Stage – Incorporates student status, unemployment, and retirement to reflect real-world demographics. Cultural characteristics – Natural language personas capture lifestyle aspects such as Brazilian social norms, interests, arts, sports, and travel. Language fidelity – All personas are written in natural Brazilian Portuguese, reflecting local naming conventions and communication styles.

The result is a dataset that is statistically grounded, culturally representative, and fully synthetic by design.

private by design

This dataset does not contain any personally identifiable information. We use actual age, name, and occupation distributions from official public sources, but are not associated with any real person, living or dead. All personas are fully synthetic, so they can be trained on authentic cultural patterns without compromising privacy.

Who is this data for?

Nemotron-Personalas-Brazil is primarily designed for Brazilian developers and researchers building sovereign AI systems. This dataset addresses the gap left by predominantly English training corpora by providing high-quality, population-representative data in Brazilian Portuguese.

Developers around the world can also leverage this dataset to improve the performance of their models and make adjustments in the Brazilian cultural and linguistic context.

Practical applications of AI

Multi-turn conversations: Use personas as seeds to generate authentic interaction datasets Domain-specific training: Build culturally aware AI assistants Bias testing and fairness: Evaluate model performance across rural and urban populations, age groups, and education levels to ensure AI works fairly across all strata of Brazilian society

why is it important

AI model builders have long struggled to access diverse, high-quality training data that reflects real-world populations. Proprietary datasets dominate enterprise AI, creating a barrier for researchers, startups, and developers from underrepresented regions.

Data diversity: Prevents narrow training and model collapse by reflecting Brazil’s entire population spectrum Cultural authenticity: Reduces reliance on Western-centric datasets and supports sovereign AI development Privacy protection: Designed to meet Brazil’s data protection requirements and emerging AI governance standards

By releasing Nemotron-Personas-Brazil under CC BY 4.0, we are democratizing access to enterprise-grade synthetic data and empowering anyone to build culturally authentic AI without cost, privacy concerns, or geographic barriers.

Nemotron – Persona – Start building in Brazil

You can load datasets directly from Hugging Face.

from dataset import Load DatasetDataset = LoadDataset(“nvidia/nemotron-personas-brazil”)

Interested in learning more about NVIDIA’s open data products or co-designing future datasets? Join the conversation on NVIDIA’s Discord.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleTurkiye Parliamentary Committee calls for new world-leading AI law
Next Article This country forced labels on AI content — who’s next?
versatileai

Related Posts

Tools

Gemini 2.0 Flash Native Image Generation Experiment

April 2, 2026
Tools

Inside the AI ​​agent strategy that helps companies improve their profitability

April 1, 2026
Tools

Storage bucket now available on Hug Face Hub

March 30, 2026
Add A Comment

Comments are closed.

Top Posts

We had Claude fine-tune our open source LLM

December 5, 202513 Views

Build a great dataset for video generation

February 12, 202513 Views

Faster Text Generation with Self-Speculative Decoding

February 13, 202512 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

We had Claude fine-tune our open source LLM

December 5, 202513 Views

Build a great dataset for video generation

February 12, 202513 Views

Faster Text Generation with Self-Speculative Decoding

February 13, 202512 Views
Don't Miss

Gemini 2.0 Flash Native Image Generation Experiment

April 2, 2026

Inside the AI ​​agent strategy that helps companies improve their profitability

April 1, 2026

Storage bucket now available on Hug Face Hub

March 30, 2026
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?