Introducing Synthetic Data Generator is a user-friendly application that takes a no-code approach to creating custom datasets using large-scale language models (LLMs). The best part: A simple step-by-step process makes dataset creation technically easy, allowing anyone to create datasets and models in minutes without any code. short demo video
What is synthetic data and why is it useful?
Synthetic data is information that is artificially generated to mimic real-world data. You can overcome data limitations by expanding or enriching your dataset.
From prompts to datasets to models
Synthetic data generators use synthetic data pipelines to obtain the required data descriptions (custom prompts) and return datasets appropriate to your use case. Behind the scenes, this is powered by distilabel and the free Hugging Face text generation API, but you don’t have to worry about these complexities and can focus on using the UI.
Supported tasks
The tool currently supports text classification and chat datasets. These tasks determine the type of dataset you generate. Classification requires categories and chat data requires conversations. Add tasks such as assessments and RAGs over time based on demand.
Text classification
Text classification is common for classifying text such as customer reviews, social media posts, and news articles. The generation of classification datasets relies on two different steps addressed in LLM. First generate various texts and then add labels to them. A good example of a synthetic text classification dataset is argilla/synthetic-text-classification-news, which classifies synthetic news articles into eight different classes.
Chat dataset
This type of dataset can be used for supervised fine-tuning (SFT). SFT is a technique that allows LLMs to manipulate conversational data, allowing users to interact with LLMs through a chat interface. A good example of a synthetic chat dataset is argilla/synthetic-sft-customer-support-single-turn. This provides an example of an LLM designed to handle customer support. In this example, the customer support topic is the synthetic data generator itself.
Typically, text classification and chat can generate 50 and 20 samples per minute, respectively. All of this is powered by the free Hugging Face API, but you can scale this up by using your own account and choosing custom models, API providers, or generation configurations. We’ll discuss this later, but first let’s look at the basics.
Let’s generate our first dataset
Create a basic chat dataset. When accessing the generator, you must log in and allow the tool access to the organization that will generate the dataset. This allows the tool to upload the generated dataset. If authentication fails, you can reset the connection at any time.
After logging in, the UI guides you through a simple three-step process.
1. Dataset description
First, enter a description of the dataset you are creating. This also includes use cases to help the generator understand your needs. Please describe your goals and the type of assistant in as much detail as possible. Pressing the “Create” button will create a sample dataset and you can proceed to step 2.
2. Configuration and adjustment
Adjust the generated system prompts based on the instructions, adjust task-specific settings, and adjust the generated sample dataset. This will help you get the specific results you are looking for. You can iterate these configurations by clicking the Save button and regenerating the sample dataset. Once you are satisfied with your settings, proceed to step 3.
3. Generate and push
Enter a name for your dataset and general information about your organization. Additionally, you can define the number of samples to generate and the temperature used for generation. This temperature represents the creativity of a generation. Press the “Generate” button to start a complete generation. Output is saved directly to Argilla and Hugging Face Hub.
You can now directly access the generated dataset by clicking the “Open in Argilla” button.
Check the dataset
Even when working with synthetic data, it is important to understand and view the data. That’s why we created a direct integration with Argilla, a collaboration tool for AI engineers and domain experts to build high-quality datasets. This enables you to effectively explore and evaluate synthetic datasets through powerful features such as semantic search and configurable filters. You can learn more about them in this guide. You can then export your selected dataset to Hugging Face Hub and use it to continue fine-tuning your model.
Training the model
Don’t worry; you can now create powerful AI models without any code using AutoTrain. To understand AutoTrain, please refer to its documentation. Now, create your own AutoTrain deployment and log in, just like you did previously for the synthetic data generator.
Remember the first argilla/synthetic-text-classification-news dataset? Let’s train a model that can correctly classify these examples. You must select the task “Text Classification” and specify the correct “Dataset Source”. Next, choose a suitable project name and press play. The free Hugging Face CPU hardware is still being developed, so you can ignore the cost warning pop-up. That’s enough hardware for this text classification example.
Now, in a few minutes, you’ll have your own model. All you need to do now is deploy it as a live service or use it as a text classification pipeline with minimal Python code.
advanced features
Although you can move from a prompt to a dedicated model without knowing anything about coding, you may prefer the option to customize and extend your deployment with more advanced technical features.
Increased speed and accuracy
You can improve speed and accuracy by creating your own deployment of the tool and configuring it to use different parameters or models. First, we need to clone the synthetic data generator. Be sure to create it as a private space so that others cannot access it. Next, you can change the default values of some environment variables. Let’s look at some scenarios.
Use another free hugging face model. To do this, change MODEL from its default value metal-llama/Llama-3.1-8B-Instruct to another model, such as metal-llama/Llama-3.1-70B-Instruct. Use OpenAI models. To do this, set BASE_URL to https://api.openai.com/v1/ and MODEL to gpt-4o. Increasing the batch size will generate more samples per minute. To do this, change BATCH_SIZE from its default value of 5 to a higher value, such as 10. Consider that your API provider may have a limit on the number of requests per minute. Private Argilla instance. To do this, set ARGILLA_URL and ARGILLA_API_KEY to the URL and API key of your free Argilla instance.
local deployment
In addition to hosting this tool on Hugging Face Spaces, we also offer it as an open source tool under the Apache 2 license. This means you can access GitHub and use, modify, and adapt it as you see fit. It can be installed as a Python package using a simple pip install synthetic-dataset-generator. Please set the appropriate environment variables during creation.
Pipeline customization
Each synthetic data pipeline is based on distilabel, a synthetic data and AI feedback framework. distilabel is open source. The great thing about pipeline code is that it’s shareable and reproducible. For example, you can find the pipeline for the argilla/synthetic-text-classification-news dataset in the hub’s repository. Alternatively, you can find many other distilabel datasets and their pipelines.
What’s next?
Synthetic data generators already offer many great features that are useful for data and model enthusiasts. Still, there are some interesting directions for improvement on GitHub, so please feel free to contribute, star, or open an issue. Here’s what we’re working on:
Search Augmentation Generation (RAG) Custom Assessments with LLMs as Judges
Start synthesis