Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Gemma 3N is fully available in the open source ecosystem!

June 27, 2025

Professor UAB builds user-friendly tools to find hidden AI security threats

June 26, 2025

Major AI Chatbot Parrot CCP Propaganda

June 26, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Friday, June 27
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»Introducing synthetic data generators
Tools

Introducing synthetic data generators

By December 17, 2024No Comments7 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email

Introducing Synthetic Data Generator is a user-friendly application that takes a no-code approach to creating custom datasets using large-scale language models (LLMs). The best part: A simple step-by-step process makes dataset creation technically easy, allowing anyone to create datasets and models in minutes without any code. short demo video

What is synthetic data and why is it useful?

Synthetic data is information that is artificially generated to mimic real-world data. You can overcome data limitations by expanding or enriching your dataset.

From prompts to datasets to models

Synthetic data generators use synthetic data pipelines to obtain the required data descriptions (custom prompts) and return datasets appropriate to your use case. Behind the scenes, this is powered by distilabel and the free Hugging Face text generation API, but you don’t have to worry about these complexities and can focus on using the UI.

Supported tasks

The tool currently supports text classification and chat datasets. These tasks determine the type of dataset you generate. Classification requires categories and chat data requires conversations. Add tasks such as assessments and RAGs over time based on demand.

Text classification

Text classification is common for classifying text such as customer reviews, social media posts, and news articles. The generation of classification datasets relies on two different steps addressed in LLM. First generate various texts and then add labels to them. A good example of a synthetic text classification dataset is argilla/synthetic-text-classification-news, which classifies synthetic news articles into eight different classes.

Chat dataset

This type of dataset can be used for supervised fine-tuning (SFT). SFT is a technique that allows LLMs to manipulate conversational data, allowing users to interact with LLMs through a chat interface. A good example of a synthetic chat dataset is argilla/synthetic-sft-customer-support-single-turn. This provides an example of an LLM designed to handle customer support. In this example, the customer support topic is the synthetic data generator itself.

Typically, text classification and chat can generate 50 and 20 samples per minute, respectively. All of this is powered by the free Hugging Face API, but you can scale this up by using your own account and choosing custom models, API providers, or generation configurations. We’ll discuss this later, but first let’s look at the basics.

Let’s generate our first dataset

Create a basic chat dataset. When accessing the generator, you must log in and allow the tool access to the organization that will generate the dataset. This allows the tool to upload the generated dataset. If authentication fails, you can reset the connection at any time.

After logging in, the UI guides you through a simple three-step process.

1. Dataset description

First, enter a description of the dataset you are creating. This also includes use cases to help the generator understand your needs. Please describe your goals and the type of assistant in as much detail as possible. Pressing the “Create” button will create a sample dataset and you can proceed to step 2.

2. Configuration and adjustment

Adjust the generated system prompts based on the instructions, adjust task-specific settings, and adjust the generated sample dataset. This will help you get the specific results you are looking for. You can iterate these configurations by clicking the Save button and regenerating the sample dataset. Once you are satisfied with your settings, proceed to step 3.

3. Generate and push

Enter a name for your dataset and general information about your organization. Additionally, you can define the number of samples to generate and the temperature used for generation. This temperature represents the creativity of a generation. Press the “Generate” button to start a complete generation. Output is saved directly to Argilla and Hugging Face Hub.

You can now directly access the generated dataset by clicking the “Open in Argilla” button.

Check the dataset

Even when working with synthetic data, it is important to understand and view the data. That’s why we created a direct integration with Argilla, a collaboration tool for AI engineers and domain experts to build high-quality datasets. This enables you to effectively explore and evaluate synthetic datasets through powerful features such as semantic search and configurable filters. You can learn more about them in this guide. You can then export your selected dataset to Hugging Face Hub and use it to continue fine-tuning your model.

Training the model

Don’t worry; you can now create powerful AI models without any code using AutoTrain. To understand AutoTrain, please refer to its documentation. Now, create your own AutoTrain deployment and log in, just like you did previously for the synthetic data generator.

Remember the first argilla/synthetic-text-classification-news dataset? Let’s train a model that can correctly classify these examples. You must select the task “Text Classification” and specify the correct “Dataset Source”. Next, choose a suitable project name and press play. The free Hugging Face CPU hardware is still being developed, so you can ignore the cost warning pop-up. That’s enough hardware for this text classification example.

Now, in a few minutes, you’ll have your own model. All you need to do now is deploy it as a live service or use it as a text classification pipeline with minimal Python code.

advanced features

Although you can move from a prompt to a dedicated model without knowing anything about coding, you may prefer the option to customize and extend your deployment with more advanced technical features.

Increased speed and accuracy

You can improve speed and accuracy by creating your own deployment of the tool and configuring it to use different parameters or models. First, we need to clone the synthetic data generator. Be sure to create it as a private space so that others cannot access it. Next, you can change the default values ​​of some environment variables. Let’s look at some scenarios.

Use another free hugging face model. To do this, change MODEL from its default value metal-llama/Llama-3.1-8B-Instruct to another model, such as metal-llama/Llama-3.1-70B-Instruct. Use OpenAI models. To do this, set BASE_URL to https://api.openai.com/v1/ and MODEL to gpt-4o. Increasing the batch size will generate more samples per minute. To do this, change BATCH_SIZE from its default value of 5 to a higher value, such as 10. Consider that your API provider may have a limit on the number of requests per minute. Private Argilla instance. To do this, set ARGILLA_URL and ARGILLA_API_KEY to the URL and API key of your free Argilla instance.

local deployment

In addition to hosting this tool on Hugging Face Spaces, we also offer it as an open source tool under the Apache 2 license. This means you can access GitHub and use, modify, and adapt it as you see fit. It can be installed as a Python package using a simple pip install synthetic-dataset-generator. Please set the appropriate environment variables during creation.

Pipeline customization

Each synthetic data pipeline is based on distilabel, a synthetic data and AI feedback framework. distilabel is open source. The great thing about pipeline code is that it’s shareable and reproducible. For example, you can find the pipeline for the argilla/synthetic-text-classification-news dataset in the hub’s repository. Alternatively, you can find many other distilabel datasets and their pipelines.

What’s next?

Synthetic data generators already offer many great features that are useful for data and model enthusiasts. Still, there are some interesting directions for improvement on GitHub, so please feel free to contribute, star, or open an issue. Here’s what we’re working on:

Search Augmentation Generation (RAG) Custom Assessments with LLMs as Judges

Start synthesis

author avatar
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleReport highlights security concerns in open source AI — THE Journal
Next Article Digital creator guide for creativity and fun

Related Posts

Tools

Gemma 3N is fully available in the open source ecosystem!

June 27, 2025
Tools

Major AI Chatbot Parrot CCP Propaganda

June 26, 2025
Tools

Introducing Chatbot Guardrails Arena

June 26, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

New Star: Discover why 보니 is the future of AI art

February 26, 20252 Views

BitMart Research: MCP+AI Agent – A new framework for AI

May 13, 20251 Views

How to build an MCP server with Gradio

April 30, 20251 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

New Star: Discover why 보니 is the future of AI art

February 26, 20252 Views

BitMart Research: MCP+AI Agent – A new framework for AI

May 13, 20251 Views

How to build an MCP server with Gradio

April 30, 20251 Views
Don't Miss

Gemma 3N is fully available in the open source ecosystem!

June 27, 2025

Professor UAB builds user-friendly tools to find hidden AI security threats

June 26, 2025

Major AI Chatbot Parrot CCP Propaganda

June 26, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?