The Data is Better Together community releases yet another important dataset for open source development. Due to the lack of open configuration datasets for text-to-image generation, we set out to release an Apache 2.0 licensed dataset for text-to-image generation. This dataset focuses on text-image preference pairs across common image generation categories, mixing different model families and varying prompt complexity.
TL;DR? All results can be found in this collection on Hugging Face Hub, and pre- and post-processing code can be found in this GitHub repository. Most importantly, there is a ready-to-use configuration dataset and flux-dev-lora-finetune. If you already want to show your support, don’t forget to Like, Subscribe and Follow before continuing reading.
Not familiar with the Data is Better Together community?
(Data is Better Together)(https://huggingface.co/data-is-better-together) is a collaboration between 🤗 Hugging Face and the open source AI community. We aim to empower the open source community to collaboratively build impactful datasets. Follow organizations to stay up to date on the latest datasets, models, and community sprints.
Similar efforts
Although there have been several previous efforts to create open image settings datasets, ours is unique because of the openness of the dataset and the code that creates it, as well as the varying complexity and categories of prompts. is. Some of these efforts are listed below.
– (yuvalkirstain/pickapic_v2)(https://huggingface.co/datasets/yuvalkirstain/pickapic_v2) – (fal.ai/imgsys)(https://imgsys.org/) – (TIGER-Lab/GenAI-Arena)( https://huggingface.co/spaces/TIGER-Lab/GenAI-Arena) – (Artificial analysis image) arena)(https://artificialanalogy.ai/text-to-image/arena)
input dataset
To get a suitable input dataset for this sprint, we started with some basic prompts, cleaned it up, filtered it for harmfulness, and injected categories and complexity using synthetic data generation with distilabel. I did. Finally, the images were generated using a flux model and a stable diffusion model. This created open-image-preferences-v1.
input prompt
Imgsys is a generative image model arena hosted by fal.ai that provides prompts and allows users to choose their preference between two model generations. Unfortunately, the generated images are not publicly available, but the associated prompts are hosted on Hugging Face. These prompts represent real-world usage of image generation, including good examples that focus on everyday generation, but this real-world usage includes duplicate and harmful prompts. That also meant I had to look at the data and do some filtering.
Reduced toxicity
We aimed to remove all NSFW prompts and images from our dataset before launching the community. We settled on a multi-model approach using two text-based classifiers and two image-based classifiers as filters. After filtering, I decided to manually check each image to ensure that no harmful content remained. Fortunately, our approach turned out to be successful.
I used the following pipeline:
Classify images as NSFW Remove all positive samples Argilla team manually reviews dataset Iterate based on review
Synthesis prompt enhancements
Because data diversity is important for data quality, we decided to enrich our dataset by synthetically rewriting prompts based on different categories and complexity. This was done using the distilabel pipeline.
Input prompt image defaults to a harp without strings
An animated stylized stringless harp with intricate details and flowing lines set against a dreamy pastel background.
Stringless, anime style harp quality. It has intricate details and flowing lines, set against a dreamy pastel background and illuminated by soft golden light, with a gentle mood and rich textures, high resolution and photorealistic.
prompt category
InstructGPT describes basic task categories for text-to-text generation, but there is no equivalent clear-cut task category for text-to-image generation. To alleviate this, we used two primary sources as input for categories: google/sdxl and Microsoft. This produced the following main categories: (‘Movies’, ‘Photography’, ‘Anime’, ‘Manga’, ‘Digital Art’, ‘Pixel Art’, ‘Fantasy Art’, ‘Neon Punk’, “3D model”, “painting”, “animation”, “illustration”). In addition to that, we also selected some mutually exclusive subcategories to allow for further diversification of the prompts. These categories and subcategories are randomly sampled, so they are approximately evenly distributed across the dataset.
gets complicated quickly
Data’s paper demonstrated that evolving complexity and diversity of prompts leads to better model generation and fine-tuning, but humans do not necessarily take the time to create a wide range of prompts. Therefore, we decided to use the same prompt in a complex and simplified way as two data points for generations with different preferences.
image generation
ArtificialAnalysis/Text-to-Image-Leaderboard provides an overview of the best performing image models. Choose the two best performing models based on licensing and availability at the hub. Additionally, we made the models belong to different model families to de-emphasize generations across different categories. Therefore, we chose stabilityai/stable-diffusion-3.5-large and black-forest-labs/FLUX.1-dev. Each of these models was then used to generate images for both simplified and complex prompts within the same style category.
result
The raw export of all annotated data includes responses to multiple choice, and each annotator can decide whether one model is better, both models perform better, or both models perform better. Choose what’s bad. Based on this, you should tune your annotators, check the performance of your model across categories, and also fine-tune your model. You can try this already on the hub. The following shows the annotated dataset.
Annotator alignment
Annotator agreement is a way to check the validity of a task. Whenever a task is too difficult, the annotator may be under-tuned, and whenever the task is too easy, the annotator may be over-tuned. Balancing is rare, but I managed to do so during this sprint. This analysis was performed using the Hugging Face dataset SQL console. Overall, the SD3.5-XL had a slightly better chance of winning within our test setup.
model performance
We found that both models performed better within their own range when considering annotator adjustments, so we conducted additional analyzes to see if there were any differences between the categories. In other words, FLUX-dev is suitable for anime, and SD3.5-XL is suitable for art and film scenarios.
Ties: Photography, Animation FLUX-dev is great at: 3D models, anime, manga SD3.5-XL is great at: Movies, Digital Art, Fantasy Art, Illustration, Neon Punk, Painting, Pixel Art
Fine-tuning the model
To validate the quality of the dataset, I decided to do some LoRA fine-tuning of the black-forest-labs/FLUX.1-dev model based on the Diffuser example on GitHub without spending too much time and resources. This process included selected samples as expected completion of the FLUX-dev model and excluded rejected samples. Interestingly, the selected fine-tuned model performs much better in art and film scenarios where it was initially lacking. You can test the tweaked adapter here.
Prompt Original Tweak a boat on the canals of Venice. Painted in gouache with soft, flowing brushstrokes and vibrant, translucent colors, the works are rich in texture and dynamic perspective, capturing the serene reflections on the water’s surface under a misty atmosphere.
On a black background, a bright orange poppy flower surrounded by an ornate golden frame is rendered in an anime style with bold outlines, exaggerated details, and dramatic chiaroscuro lighting.
Grainy shot of a robot cooking in the kitchen. Comes with soft shadows and a nostalgic film texture.
community
In short, we annotated 10,000 preference pairs with 2/3 annotator overlap and received over 30,000 responses from over 250 community members within two weeks. The leaderboard in the image also shows that some community members have prioritized over 5,000. We would like to thank everyone who participated in this sprint. Specifically, the top three users will receive a one-month Hugging Face Pro membership. Be sure to follow us on the hub: aashish1904, prithivMLmods, Malalatiana.
What’s next?
After another successful community sprint, we will continue to organize community sprints in Hugging Face Hub. Be sure to follow the Data Is Better Together organization to stay informed. We also encourage community members to take action themselves and are happy to guide and re-share on social and within their organizations on the hub. You can contribute in several ways.
Join other sprints. You can propose your own sprint or request high-quality datasets. Fine-tune your model based on your preferred dataset. One idea is to do a full SFT tweak of SDXL or FLUX-schnell. Another idea is to do some DPO/ORPO tweaking. We evaluate the performance improvements of the LoRA adapter compared to the original SD3.5-XL and FLUX-dev models.