Nemotron-personas-japan: Synthetic dataset for sovereign AI

Composite AI approach for Japanese individuals based on real-world distribution

Open data for the future of AI in Japan

Until now, it has been nearly possible to build an AI that truly understands Japanese culture without high quality and diverse training data. We provide a platform that balances privacy protection and regulatory compliance for system construction.

Created using NVIDIA’s NeMo Data Designer, an enterprise synthetic data generation system, Nemotron-Persona-Japan was developed as a Japanese version following the success of the nearly universally used US Persona dataset. This release is the first in a global collection of synthetic persona datasets and data construction methods to support sovereign AI development in countries and regions.

This dataset is designed to preferentially work with open source large-scale language models (LLMs), including the Nemotron model, and is easily amenable to fine-tuning for Japanese AI applications, from enterprise chatbots to AI agents in various domains.

Dataset contents

A total of 6 million personas (6 personas for each record, 1 million records) written in natural Japanese 22 items per record: 6 persona-related items and 16 context items based on official demographic and labor statistics Total number of entries: Approximately 1.4 billion: Approximately 850 million persona-related unique names: Approximately 950,000 persona-related unique names: Generated synthetic data to intensively cover 1,500 regional, regional, demographic, and personality trait axes reflecting unprecedented diversity in the domestic workforce Diverse persona types: Occupation, Sports, Arts, Travel, Culinary Arts Natural language persona attributes: cultural background, skills and expertise, career goals and aspirations, hobbies and interests Established under CC BY 4.0 license, available for commercial and non-commercial use

How to build Nemotron-Persona-Japan

data generation pipeline

It is built using NeMo Data Designer, NVIDIA’s microservice for synthetic data generation. This complex AI system enables complex Jinja templates, validation with Pydantic, structured automatic output, retries, and support for multiple generation backends. These are the tools needed to generate such large synthetic datasets. In addition, we also utilize the following models:

Probabilistic cal model for statistics-based graffita generation (Apache-2.0) GPT-OSS-120B for Japanese sentence generation (Apache-2.0)

Looking back at the background of Japanese culture

Nemotron-Personas-Japan was created at the moment designed to align with Japan’s official demographic and labor statistics, taking into account the following important points regarding AI training:

Education: Where degree levels are grouped together in national statistics, we have introduced finer distinctions to allow models to reflect different educational pathways. Occupations: We’ve included additional categories (such as Employer and Specialization) to expand the range of occupations used in training. The system could better reflect region-specific rights. Digital Divide: Considering the comparison of digital literacy by age group, reflecting the actual state of technology usage in Japan.

Designed to protect your privacy

This dataset does not contain any personally identifiable information (PII).

Assumed user

Nemotron-Persona-Japan is designed for Japanese model developers developing Japanese sovereign AI systems. Currently, most training data used by LLM developers is in English, and developers in regions such as Japan and India struggle to obtain high-quality data in their native languages.

Including this dataset, NVIDIA’s Nemotron-Persona continuation solution directly addresses the problem at hand. We help developers generate diverse and complex data in local languages while capturing local nuances.

Therefore, we hope that it will be helpful for all AI model developers to expand the adoption of their models in Japan and understand the Japanese cultural context.

Utilization for practical AI applications

You can use the synthetic personas included in this dataset to:

Multi-turn conversation synthesis: Utilizing personas as “seeds” to create human-like dialogue datasets Developing domain-specific AI assistants: Bias testing and fairness to create datasets for building culturally sensitive AI assistants: Assessing how models and AI agent systems function across overlaps, such as rural and urban, different age groups, or diverse educational levels, to achieve AI that works fairly for all segments of Japanese society

The importance of synthetic persona data

Access to diverse, high-quality training data that reflects people in the real world has long been a challenge for AI development. Enterprise AI development has been dominated by private data, creating a barrier for researchers, startups, and especially AI developers in regions with less available data.

Data diversity: Reflecting the entire Japanese population prevents biased learning and model collapse. Cultural authenticity: Reduce dependence on Western-centric datasets and support the development of sovereign AI systems. Privacy and compliance: Meet Japan’s Personal Information Protection Act (PIPA) requirements and future AI governance.

By publishing Nemotron-Personas-Japan under CC BY 4.0, we’re giving anyone access to high-quality, enterprise-grade synthetic data to build AI systems that accurately reflect their cultural backgrounds, without traditional barriers such as cost, privacy strategies, or geographic barriers.

Use it now

You can download this dataset using the following command. We are currently developing an AI that truly understands Japanese culture and language.

from dataset import load dataset ds = load dataset(“nvidia/Nemotron-Persona-Japan”)

Examples of usage for building production applications:

Leverage personas as seeds for conversation generation Fine-tune models with data that reflects cultural context Build a personalized engine that reflects the entire national demographic Develop domain-specific AI agents with national context

From model developers building Sovereign AI in Japan to global developers looking to reach a broader region, the Nemotron-Personas-Japan dataset provides the professional, privacy-friendly foundation your applications need.

versatileai

See Full Bio

What's Hot

Pocket FM and OpenAI partner on content production: Rediff Moneynews

Gemini 2.5 Pro Preview: Even better coding performance

Build physical AI using virtual simulation data

Gemini 2.5 Pro Preview: Even better coding performance

Build physical AI using virtual simulation data

How NVIDIA builds open data for AI

Gemini’s Security Safeguard Advance – Google DeepMind

Wix Get 1 hour to expand generative AI capabilities and accelerate product innovation – TradingView News

Competitive programming with AlphaCode-Google Deepmind

Most Popular

Gemini’s Security Safeguard Advance – Google DeepMind

Wix Get 1 hour to expand generative AI capabilities and accelerate product innovation – TradingView News

Competitive programming with AlphaCode-Google Deepmind

Don't Miss

Pocket FM and OpenAI partner on content production: Rediff Moneynews

Gemini 2.5 Pro Preview: Even better coding performance

Build physical AI using virtual simulation data

Subscribe to Updates

What's Hot

Nemotron-personas-japan: Synthetic dataset for sovereign AI

Open data for the future of AI in Japan

Dataset contents

How to build Nemotron-Persona-Japan

data generation pipeline

Assumed user

Utilization for practical AI applications

The importance of synthetic persona data

Use it now

Related Posts