Co-design data for sovereign AI

A composite AI approach to Brazilian-Portuguese personas based on real-world distribution

Grounding Brazil’s AI with real data

Building AI systems that serve citizens requires data that reflects local languages, demographics, and cultural backgrounds. For Brazil, a country with more than 200 million people living in different regions, this issue remains a persistent challenge as much of today’s high-quality training data is English-centric or not commercially available.

Nemotron-Persona-Brazil helps bridge that gap. This is an open dataset (CC BY 4.0) of 6 million fully synthetic personas statistically based on official census and labor data from the Brazilian Institute of Geography and Statistics (IBGE). All personas are aligned to real-world demographics, geographic distribution, and occupational distribution, but do not represent real people.

This release expands on NVIDIA’s ever-expanding Nemotron-Persona collection, which already includes the US, Japan, India, and Singapore. Similar to other datasets in the collection, the Brazil dataset covers attributes such as age, gender, education, occupation, and location.

This dataset is designed for Brazilian developers and researchers building sovereign AI, and is locally rooted, culturally informed, and uses commercially available data (CC BY 4.0). It was built in collaboration with WideLabs, an NVIDIA Inception member with extensive experience supporting AI adoption in government and regulatory sectors across Latin America.

What does the dataset contain?

overview:

6 million Brazilian personas (1 million records each x 6 personas) Total approximately 1.4 billion tokens (including approximately 450 million persona tokens) 20 fields per record: 6 persona fields + 14 context fields based on official statistics Total geographic coverage: All 26 Brazilian states + Federal District Up to 457,000 Unique Portuguese names for over 1,500 occupational categories reflecting Brazil’s workforce and multiple persona types: professional, sports, arts, and travel among others.

Each persona is written in natural Brazilian Portuguese and includes their cultural background, skills, goals, hobbies, and interests.

Construction method

data generation pipeline

Nemotron-Personas-Brazil was built using NeMo Data Designer, NVIDIA’s complex AI system for synthetic data generation. This pipeline supports the structured generation, validation, and retry mechanisms required to generate large-scale, population-aware datasets.

Key components include:

Probabilistic Graphical Models for Statistical Reasoning (Apache-2.0) GPT-OSS-120B for Narrative Generation in Brazilian Portuguese (Apache-2.0)

An enhanced version of Nemotron-Personalas-Brazil is now available directly within NeMo Data Designer, allowing developers to generate, refine, and extend Brazilian-Portuguese personas as part of their own synthetic data pipelines.

enhanced cultural background

To understand the sociodemographic and geographic diversity and complexity of Brazil’s population, Nemotron-Persona-Brazil leveraged census and labor data published by the Brazilian Institute of Geography and Statistics (IBGE).

Geography – Personas are anchored at the state and local government level, reflecting regional differences across Brazil’s five macro-regions. Occupation – Go beyond job titles to include skills, expertise, and career trajectories such as micro-entrepreneurship or regional trade. Life Stage – Incorporates student status, unemployment, and retirement to reflect real-world demographics. Cultural characteristics – Natural language personas capture lifestyle aspects such as Brazilian social norms, interests, arts, sports, and travel. Language fidelity – All personas are written in natural Brazilian Portuguese, reflecting local naming conventions and communication styles.

The result is a dataset that is statistically grounded, culturally representative, and fully synthetic by design.

private by design

This dataset does not contain any personally identifiable information. We use actual age, name, and occupation distributions from official public sources, but are not associated with any real person, living or dead. All personas are fully synthetic, so they can be trained on authentic cultural patterns without compromising privacy.

Who is this data for?

Nemotron-Personalas-Brazil is primarily designed for Brazilian developers and researchers building sovereign AI systems. This dataset addresses the gap left by predominantly English training corpora by providing high-quality, population-representative data in Brazilian Portuguese.

Developers around the world can also leverage this dataset to improve the performance of their models and make adjustments in the Brazilian cultural and linguistic context.

Practical applications of AI

Multi-turn conversations: Use personas as seeds to generate authentic interaction datasets Domain-specific training: Build culturally aware AI assistants Bias testing and fairness: Evaluate model performance across rural and urban populations, age groups, and education levels to ensure AI works fairly across all strata of Brazilian society

why is it important

AI model builders have long struggled to access diverse, high-quality training data that reflects real-world populations. Proprietary datasets dominate enterprise AI, creating a barrier for researchers, startups, and developers from underrepresented regions.

Data diversity: Prevents narrow training and model collapse by reflecting Brazil’s entire population spectrum Cultural authenticity: Reduces reliance on Western-centric datasets and supports sovereign AI development Privacy protection: Designed to meet Brazil’s data protection requirements and emerging AI governance standards

By releasing Nemotron-Personas-Brazil under CC BY 4.0, we are democratizing access to enterprise-grade synthetic data and empowering anyone to build culturally authentic AI without cost, privacy concerns, or geographic barriers.

Nemotron – Persona – Start building in Brazil

You can load datasets directly from Hugging Face.

from dataset import Load DatasetDataset = LoadDataset(“nvidia/nemotron-personas-brazil”)

Interested in learning more about NVIDIA’s open data products or co-designing future datasets? Join the conversation on NVIDIA’s Discord.

versatileai

See Full Bio

What's Hot

Musk and Zuckerberg convinced Trump to repeal AI executive order

Introducing Gemini Omni

IMDA updates AI framework, OpenAI opens Singapore AI Lab

Musk and Zuckerberg convinced Trump to repeal AI executive order

Introducing Gemini Omni

IMDA updates AI framework, OpenAI opens Singapore AI Lab

Edimakor V4.2.0 unveils AI video tools at VEO 3

Pillar Security raises $9 million to create AI security guardrails for businesses

10 Best AI for PowerPoint presentations

Most Popular

Edimakor V4.2.0 unveils AI video tools at VEO 3

Pillar Security raises $9 million to create AI security guardrails for businesses

10 Best AI for PowerPoint presentations

Don't Miss

Musk and Zuckerberg convinced Trump to repeal AI executive order

Introducing Gemini Omni

IMDA updates AI framework, OpenAI opens Singapore AI Lab

Subscribe to Updates

What's Hot

Co-design data for sovereign AI

Grounding Brazil’s AI with real data

What does the dataset contain?

overview:

Construction method

data generation pipeline

enhanced cultural background

private by design

Who is this data for?

Practical applications of AI

why is it important

Nemotron – Persona – Start building in Brazil

Related Posts