Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

AI Labs are facing a new challenge: Show us your business plan

January 28, 2026

Co-design data for sovereign AI

January 28, 2026

Gemini Robotics 1.5 brings AI agents to the physical world — Google DeepMind

January 27, 2026
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Wednesday, January 28
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»Co-design data for sovereign AI
Tools

Co-design data for sovereign AI

versatileaiBy versatileaiJanuary 28, 2026No Comments5 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

brazil 2
A composite AI approach to Brazilian-Portuguese personas based on real-world distribution

Grounding Brazil’s AI with real data

Building AI systems that serve citizens requires data that reflects local languages, demographics, and cultural backgrounds. For Brazil, a country with more than 200 million people living in different regions, this issue remains a persistent challenge as much of today’s high-quality training data is English-centric or not commercially available.

Nemotron-Persona-Brazil helps bridge that gap. This is an open dataset (CC BY 4.0) of 6 million fully synthetic personas statistically based on official census and labor data from the Brazilian Institute of Geography and Statistics (IBGE). All personas are aligned to real-world demographics, geographic distribution, and occupational distribution, but do not represent real people.

This release expands on NVIDIA’s ever-expanding Nemotron-Persona collection, which already includes the US, Japan, India, and Singapore. Similar to other datasets in the collection, the Brazil dataset covers attributes such as age, gender, education, occupation, and location.

This dataset is designed for Brazilian developers and researchers building sovereign AI, and is locally rooted, culturally informed, and uses commercially available data (CC BY 4.0). It was built in collaboration with WideLabs, an NVIDIA Inception member with extensive experience supporting AI adoption in government and regulatory sectors across Latin America.

What does the dataset contain?

Screenshot 2026-01-27 PM 4.05.28

overview:

6 million Brazilian personas (1 million records each x 6 personas) Total approximately 1.4 billion tokens (including approximately 450 million persona tokens) 20 fields per record: 6 persona fields + 14 context fields based on official statistics Total geographic coverage: All 26 Brazilian states + Federal District Up to 457,000 Unique Portuguese names for over 1,500 occupational categories reflecting Brazil’s workforce and multiple persona types: professional, sports, arts, and travel among others.

Each persona is written in natural Brazilian Portuguese and includes their cultural background, skills, goals, hobbies, and interests.

Construction method

data generation pipeline

Nemotron-Personas-Brazil was built using NeMo Data Designer, NVIDIA’s complex AI system for synthetic data generation. This pipeline supports the structured generation, validation, and retry mechanisms required to generate large-scale, population-aware datasets.

Key components include:

Probabilistic Graphical Models for Statistical Reasoning (Apache-2.0) GPT-OSS-120B for Narrative Generation in Brazilian Portuguese (Apache-2.0)

An enhanced version of Nemotron-Personalas-Brazil is now available directly within NeMo Data Designer, allowing developers to generate, refine, and extend Brazilian-Portuguese personas as part of their own synthetic data pipelines.

enhanced cultural background

To understand the sociodemographic and geographic diversity and complexity of Brazil’s population, Nemotron-Persona-Brazil leveraged census and labor data published by the Brazilian Institute of Geography and Statistics (IBGE).

Geography – Personas are anchored at the state and local government level, reflecting regional differences across Brazil’s five macro-regions. Occupation – Go beyond job titles to include skills, expertise, and career trajectories such as micro-entrepreneurship or regional trade. Life Stage – Incorporates student status, unemployment, and retirement to reflect real-world demographics. Cultural characteristics – Natural language personas capture lifestyle aspects such as Brazilian social norms, interests, arts, sports, and travel. Language fidelity – All personas are written in natural Brazilian Portuguese, reflecting local naming conventions and communication styles.

The result is a dataset that is statistically grounded, culturally representative, and fully synthetic by design.

private by design

This dataset does not contain any personally identifiable information. We use actual age, name, and occupation distributions from official public sources, but are not associated with any real person, living or dead. All personas are fully synthetic, so they can be trained on authentic cultural patterns without compromising privacy.

Who is this data for?

Nemotron-Personalas-Brazil is primarily designed for Brazilian developers and researchers building sovereign AI systems. This dataset addresses the gap left by predominantly English training corpora by providing high-quality, population-representative data in Brazilian Portuguese.

Developers around the world can also leverage this dataset to improve the performance of their models and make adjustments in the Brazilian cultural and linguistic context.

Practical applications of AI

Multi-turn conversations: Use personas as seeds to generate authentic interaction datasets Domain-specific training: Build culturally aware AI assistants Bias testing and fairness: Evaluate model performance across rural and urban populations, age groups, and education levels to ensure AI works fairly across all strata of Brazilian society

why is it important

AI model builders have long struggled to access diverse, high-quality training data that reflects real-world populations. Proprietary datasets dominate enterprise AI, creating a barrier for researchers, startups, and developers from underrepresented regions.

Data diversity: Prevents narrow training and model collapse by reflecting Brazil’s entire population spectrum Cultural authenticity: Reduces reliance on Western-centric datasets and supports sovereign AI development Privacy protection: Designed to meet Brazil’s data protection requirements and emerging AI governance standards

By releasing Nemotron-Personas-Brazil under CC BY 4.0, we are democratizing access to enterprise-grade synthetic data and empowering anyone to build culturally authentic AI without cost, privacy concerns, or geographic barriers.

Nemotron – Persona – Start building in Brazil

You can load datasets directly from Hugging Face.

from dataset import Load DatasetDataset = LoadDataset(“nvidia/nemotron-personas-brazil”)

Interested in learning more about NVIDIA’s open data products or co-designing future datasets? Join the conversation on NVIDIA’s Discord.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleGemini Robotics 1.5 brings AI agents to the physical world — Google DeepMind
Next Article AI Labs are facing a new challenge: Show us your business plan
versatileai

Related Posts

Tools

Gemini Robotics 1.5 brings AI agents to the physical world — Google DeepMind

January 27, 2026
Tools

Retailers consider AI retail options

January 27, 2026
Tools

**NVIDIA Earth-2 Open Models Spread Across the Weather Stack**

January 26, 2026
Add A Comment

Comments are closed.

Top Posts

Wall Street is pleased with Microsoft as it spends $100 billion on AI. Microsoft

July 30, 202510 Views

CIO’s Governance Guide

January 22, 20268 Views

Things security leaders need to know

July 9, 20258 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Wall Street is pleased with Microsoft as it spends $100 billion on AI. Microsoft

July 30, 202510 Views

CIO’s Governance Guide

January 22, 20268 Views

Things security leaders need to know

July 9, 20258 Views
Don't Miss

AI Labs are facing a new challenge: Show us your business plan

January 28, 2026

Co-design data for sovereign AI

January 28, 2026

Gemini Robotics 1.5 brings AI agents to the physical world — Google DeepMind

January 27, 2026
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?