Synthetic data for sovereign AI

A composite AI approach to Indian personas based on real-world distribution.

Open data for India’s AI future

With over 700 million internet users, numerous languages, and a rapidly growing developer ecosystem, India is one of the countries with the world’s greatest opportunities for AI. However, most open datasets reflect Western norms and English-only contexts, creating data gaps that limit the adoption of AI in India’s multilingual, multiscript environment.

Today, we are releasing Nemotron-personas-India, the first open synthetic dataset of Indian personas aligned to India’s real-world demographic, geographic, and cultural distribution. Licensed under CC BY 4.0, this dataset provides a privacy-preserving and regulatory-friendly foundation for scaling AI systems that reflect Indian society without relying on sensitive personal data.

Built with NeMo Data Designer, NVIDIA’s enterprise-grade synthetic data generation microservice, Nemotron-Personalas-India expands the global collection of sovereign AI datasets. It builds on the success of the US and Japanese Persona datasets and includes new features specifically designed for the culturally rich Indian landscape.

This dataset seamlessly integrates with Nemotron models and other open source LLMs, making it easy to fine-tune AI systems for Indian use cases, from multilingual chatbots to culturally-based expert co-pilots.

This release complements our previous suite of Hindi evaluation datasets, including ChatRAG-Hi, IFEval-Hi, MT-Bench-Hi, GSM8K-Hi, and BFCL-Hi, supporting a complete pipeline from synthetic data generation to rigorous model evaluation for Indian AI systems.

What does the dataset contain?

21 million total personas (3 million records x 7 personas each) Multilingual support: English and Hindi, both Devanagari and Latin scripts 27 fields per record: Persona characteristics + contextual attributes based on official census and labor statistics, including age, gender, education, occupation, state, district, etc. Total 7.7 billion tokens (of which 2.9 billion persona tokens) English: 1 billion total tokens; 394 million persona tokens Hindi (Devanagari): 4.7 billion total tokens, 1.8 billion persona tokens Hindi (Latin): 2 billion total tokens, 746 million persona tokens Approximately 560,000 unique full names reflecting India’s vast linguistic diversity 29,000 occupational categories, including informal, formal, and traditional fields All 36 states and 640 districts of India represented in natural language fields Cultural background, linguistic background, skills and expertise, hobbies and interests Persona types: Includes general, professional, linguistic, culinary, sports, arts, and travel personas Licensed under CC BY 4.0 for commercial and non-commercial use

Construction method

data generation pipeline

Created using NeMo Data Designer, NVIDIA’s synthetic data generation microservice. This composite AI system enables complex Jinja template generation, Pydantic validation, structured output, automatic retries, and supports multiple generation backends – the tools needed to scale synthetic datasets of this size. We also utilized the following models:

Probabilistic Graphical Models for Statistical Reasoning (Apache-2.0) GPT-OSS-120B (Apache-2.0) for Narrative Generation in English, Hindi (Devanagari), and Hindi (Latin)

embedded cultural background

This dataset has been expanded to align with India’s official demographic distribution from the 2011 Census and include attributes essential for reliable AI training.

Education: Expanded degree levels to reflect India’s diverse educational pathways Occupation: Includes formal, informal, and traditional sectors such as farming, tailoring, and street vending Life stages: Includes student, housewife, retiree, and unemployed categories Cultural characteristics: Family structure, local festivals, marriage traditions, and norms Digital divide: Modeled usage patterns across urban/rural, age, and income differences Language diversity: The spoken language of each composite persona includes incredible diversity in terms of first, second, and third languages.

private by design

No real name. There is no risk of re-identification.

All personas are fully synthetic. Although based on real-world distributions from the 2011 Census and parsed Indian electoral roll data, the data is not linked to any individual, living or dead. This allows developers to safely train AI systems without exposing themselves to privacy risks or regulatory barriers.

who is this for

Built for India, ready for the world

Nemotron‑Personalas‑India is designed for developers building sovereign AI systems for the Indian market, as well as global teams looking to adapt models to India’s unique language, culture, and social context.

Currently, most open datasets reflect English-speaking Western standards, which limits AI performance in India’s multilingual, multi-character, and demographically complex environment.

Practical applications of AI

Nemotron‑Personalas‑India allows teams to:

Generate diverse and realistic training data in Indian languages and scripts Fine-tune models to capture local social, professional, and cultural nuances Build region-aware AI agents that generalize to India’s many communities Develop domain-specific co-pilots tailored to India’s professional and civic workflows Create multilingual systems capable of handling complex multi-turn conversations and varying levels of digital fluency

why is it important

India’s 1.4 billion people speak hundreds of languages and live across vast cultural, economic and geographic divides. India’s National AI Portal estimates that more than 7,000 AI startups and research institutes are working on building regionally relevant AI systems, and government programs such as the Digital India Initiative and IndiaAI are accelerating adoption.

However, progress is constrained by fundamental gaps. It is culturally-based, high-quality training data that reflects India’s demographic realities. Without representative datasets, AI systems struggle to code-switch between English and Hindi, fail to understand local occupational categories, and miss out on cultural context essential to trust and adoption.

By reflecting India’s actual geographic and population distribution, this dataset improves the diversity of synthetically generated data, reduces bias, and prevents model collapse (degradation caused by unsupervised training on the output of another model).

Nemotron-Personalas-India supports model builders in India in developing sovereign AI systems that incorporate important region-specific demographics and cultural backgrounds.

Start building with Nemotron-Personas-India

Do you want to build an AI system that understands India’s culture, language, and people?

To start experimenting today:

from dataset import load_dataset nemotron_personas_en =load_dataset(“nvidia/Nemotron-Persona-India”, “en_IN”) nemotron_personas_hi_deva =load_dataset(“nvidia/Nemotron-Persona-India”, “Hello, Deva_IN”) nemotron_personas_hi_latn =load_dataset(“nvidia/Nemotron-Persona-India”, “Hello Latin IN”)

Whether you are a model builder in India developing Sovereign AI or a global developer seeking better regional adoption, Nemotron-personas-India provides a secure foundation for reliable privacy for your applications.

Please download. Please make fine adjustments. Build AI that understands India. If you’re ready to dig deeper, an enhanced version of Nemotron-Personalas-India (including first name, last name, religion, composite address, etc.) is available in NeMo Data Designer.

versatileai

See Full Bio

What's Hot

Samsung launches access to ChatGPT Enterprise and Codex after AI restrictions

Why Five Eyes spy agencies warn they will be hit by AI cyber threats this year

OCR parameters for 50 languages from 1.5 million to 34.5 million

Samsung launches access to ChatGPT Enterprise and Codex after AI restrictions

Why Five Eyes spy agencies warn they will be hit by AI cyber threats this year

OCR parameters for 50 languages from 1.5 million to 34.5 million

KREA 1 Image Model launches with excellent aesthetic controls and custom training for AI art generation | AI News Details

Gemini 2.5 update from Google Deepmind

Can research agents keep secrets?

Most Popular

KREA 1 Image Model launches with excellent aesthetic controls and custom training for AI art generation | AI News Details

Gemini 2.5 update from Google Deepmind

Can research agents keep secrets?

Don't Miss

Samsung launches access to ChatGPT Enterprise and Codex after AI restrictions

Why Five Eyes spy agencies warn they will be hit by AI cyber threats this year

OCR parameters for 50 languages from 1.5 million to 34.5 million

Subscribe to Updates

What's Hot

Synthetic data for sovereign AI

Open data for India’s AI future

What does the dataset contain?

Construction method

data generation pipeline

embedded cultural background

private by design

who is this for

Practical applications of AI

why is it important

Start building with Nemotron-Personas-India

Related Posts