Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Business Insider begins publishing stories with AI ‘authors’

October 24, 2025

Super charging OSS robotics learning

October 24, 2025

Autonomy in the real world? Druid AI releases AI agent “Factory”

October 24, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Friday, October 24
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»Introducing synthetic data generators
Tools

Introducing synthetic data generators

By December 17, 2024No Comments7 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email

Introducing Synthetic Data Generator is a user-friendly application that takes a no-code approach to creating custom datasets using large-scale language models (LLMs). The best part: A simple step-by-step process makes dataset creation technically easy, allowing anyone to create datasets and models in minutes without any code. short demo video

What is synthetic data and why is it useful?

Synthetic data is information that is artificially generated to mimic real-world data. You can overcome data limitations by expanding or enriching your dataset.

From prompts to datasets to models

Synthetic data generators use synthetic data pipelines to obtain the required data descriptions (custom prompts) and return datasets appropriate to your use case. Behind the scenes, this is powered by distilabel and the free Hugging Face text generation API, but you don’t have to worry about these complexities and can focus on using the UI.

Supported tasks

The tool currently supports text classification and chat datasets. These tasks determine the type of dataset you generate. Classification requires categories and chat data requires conversations. Add tasks such as assessments and RAGs over time based on demand.

Text classification

Text classification is common for classifying text such as customer reviews, social media posts, and news articles. The generation of classification datasets relies on two different steps addressed in LLM. First generate various texts and then add labels to them. A good example of a synthetic text classification dataset is argilla/synthetic-text-classification-news, which classifies synthetic news articles into eight different classes.

Chat dataset

This type of dataset can be used for supervised fine-tuning (SFT). SFT is a technique that allows LLMs to manipulate conversational data, allowing users to interact with LLMs through a chat interface. A good example of a synthetic chat dataset is argilla/synthetic-sft-customer-support-single-turn. This provides an example of an LLM designed to handle customer support. In this example, the customer support topic is the synthetic data generator itself.

Typically, text classification and chat can generate 50 and 20 samples per minute, respectively. All of this is powered by the free Hugging Face API, but you can scale this up by using your own account and choosing custom models, API providers, or generation configurations. We’ll discuss this later, but first let’s look at the basics.

Let’s generate our first dataset

Create a basic chat dataset. When accessing the generator, you must log in and allow the tool access to the organization that will generate the dataset. This allows the tool to upload the generated dataset. If authentication fails, you can reset the connection at any time.

After logging in, the UI guides you through a simple three-step process.

1. Dataset description

First, enter a description of the dataset you are creating. This also includes use cases to help the generator understand your needs. Please describe your goals and the type of assistant in as much detail as possible. Pressing the “Create” button will create a sample dataset and you can proceed to step 2.

2. Configuration and adjustment

Adjust the generated system prompts based on the instructions, adjust task-specific settings, and adjust the generated sample dataset. This will help you get the specific results you are looking for. You can iterate these configurations by clicking the Save button and regenerating the sample dataset. Once you are satisfied with your settings, proceed to step 3.

3. Generate and push

Enter a name for your dataset and general information about your organization. Additionally, you can define the number of samples to generate and the temperature used for generation. This temperature represents the creativity of a generation. Press the “Generate” button to start a complete generation. Output is saved directly to Argilla and Hugging Face Hub.

You can now directly access the generated dataset by clicking the “Open in Argilla” button.

Check the dataset

Even when working with synthetic data, it is important to understand and view the data. That’s why we created a direct integration with Argilla, a collaboration tool for AI engineers and domain experts to build high-quality datasets. This enables you to effectively explore and evaluate synthetic datasets through powerful features such as semantic search and configurable filters. You can learn more about them in this guide. You can then export your selected dataset to Hugging Face Hub and use it to continue fine-tuning your model.

Training the model

Don’t worry; you can now create powerful AI models without any code using AutoTrain. To understand AutoTrain, please refer to its documentation. Now, create your own AutoTrain deployment and log in, just like you did previously for the synthetic data generator.

Remember the first argilla/synthetic-text-classification-news dataset? Let’s train a model that can correctly classify these examples. You must select the task “Text Classification” and specify the correct “Dataset Source”. Next, choose a suitable project name and press play. The free Hugging Face CPU hardware is still being developed, so you can ignore the cost warning pop-up. That’s enough hardware for this text classification example.

Now, in a few minutes, you’ll have your own model. All you need to do now is deploy it as a live service or use it as a text classification pipeline with minimal Python code.

advanced features

Although you can move from a prompt to a dedicated model without knowing anything about coding, you may prefer the option to customize and extend your deployment with more advanced technical features.

Increased speed and accuracy

You can improve speed and accuracy by creating your own deployment of the tool and configuring it to use different parameters or models. First, we need to clone the synthetic data generator. Be sure to create it as a private space so that others cannot access it. Next, you can change the default values ​​of some environment variables. Let’s look at some scenarios.

Use another free hugging face model. To do this, change MODEL from its default value metal-llama/Llama-3.1-8B-Instruct to another model, such as metal-llama/Llama-3.1-70B-Instruct. Use OpenAI models. To do this, set BASE_URL to https://api.openai.com/v1/ and MODEL to gpt-4o. Increasing the batch size will generate more samples per minute. To do this, change BATCH_SIZE from its default value of 5 to a higher value, such as 10. Consider that your API provider may have a limit on the number of requests per minute. Private Argilla instance. To do this, set ARGILLA_URL and ARGILLA_API_KEY to the URL and API key of your free Argilla instance.

local deployment

In addition to hosting this tool on Hugging Face Spaces, we also offer it as an open source tool under the Apache 2 license. This means you can access GitHub and use, modify, and adapt it as you see fit. It can be installed as a Python package using a simple pip install synthetic-dataset-generator. Please set the appropriate environment variables during creation.

Pipeline customization

Each synthetic data pipeline is based on distilabel, a synthetic data and AI feedback framework. distilabel is open source. The great thing about pipeline code is that it’s shareable and reproducible. For example, you can find the pipeline for the argilla/synthetic-text-classification-news dataset in the hub’s repository. Alternatively, you can find many other distilabel datasets and their pipelines.

What’s next?

Synthetic data generators already offer many great features that are useful for data and model enthusiasts. Still, there are some interesting directions for improvement on GitHub, so please feel free to contribute, star, or open an issue. Here’s what we’re working on:

Search Augmentation Generation (RAG) Custom Assessments with LLMs as Judges

Start synthesis

author avatar
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleReport highlights security concerns in open source AI — THE Journal
Next Article Digital creator guide for creativity and fun

Related Posts

Tools

Super charging OSS robotics learning

October 24, 2025
Tools

Autonomy in the real world? Druid AI releases AI agent “Factory”

October 24, 2025
Tools

Co-building an open agent ecosystem: Introducing OpenEnv

October 23, 2025
Add A Comment

Comments are closed.

Top Posts

Paris AI Safety Breakfast #3: Yoshua Bengio

February 13, 20256 Views

WhatsApp blocks AI chatbots to protect business platform

October 19, 20254 Views

Investigate top AI security threats

October 23, 20253 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Paris AI Safety Breakfast #3: Yoshua Bengio

February 13, 20256 Views

WhatsApp blocks AI chatbots to protect business platform

October 19, 20254 Views

Investigate top AI security threats

October 23, 20253 Views
Don't Miss

Business Insider begins publishing stories with AI ‘authors’

October 24, 2025

Super charging OSS robotics learning

October 24, 2025

Autonomy in the real world? Druid AI releases AI agent “Factory”

October 24, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?