Over the past few months we have been working on data. With this collaboration between embracing faces and Aguilla and the support of the open source ML community, our goal was to enable collectively creating datasets that have an impact on the open source community.
Now we have decided to move forward with the same goal. We organized it into two sections: community initiatives and cookbook initiatives, to provide an overview of achievements and tasks that everyone can contribute.
Community initiatives
The first step in this initiative focused on the prompt ranking project. Our goal was to create a dataset of synthetic and human-generated 10K prompts ranked by quality. The community response was immediate!
A few days later, over 385 people participated. We have released the DIBT/10K_PROMPTS_RANKED dataset for prompt ranking tasks or synthetic data generation. The dataset was used to build new models such as Spin.
Looking at global support from the community, we realized that English-centered data alone is not enough, and that open LLM does not have enough language-specific benchmarks. So, we created the Multilingual Prompt Evaluation Project (MPEP) with the aim of developing leaderboards for multiple languages. To that end, a subset of 500 high quality prompts from DIBT/10K_PROMPTS_RANKED was chosen to be translated into different languages.
Over 18 language leaders have created spaces for translation. Complete Dutch, Russian or Spanish translations, and more efforts are working towards a full translation of the prompt. Creating a Dataset Builder Community in Discord
We will continue to support our community’s efforts in the future, focusing on building datasets through tools and documentation.
Cookbook effort
As part of DIBT, we also created guides and tools to help the community build valuable datasets themselves.
Domain-specific datasets: Bootstraps the creation of more domain-specific datasets for training models, bringing together engineers and domain experts. DPO/ORPO datasets: help to nurture a community of people building more DPO-style datasets for different languages, domains, and tasks. KTO dataset: To allow communities to create their own KTO datasets.
What did we learn?
The community is eager to participate in these efforts and is excited to work collectively on the dataset. There are existing inequalities that must be overcome to ensure an inclusive and comprehensive benchmark. Currently, the open source community is underestimating datasets for specific languages, domains, and tasks. There are many of the tools that the community needs to effectively cooperate in building valuable datasets.
How can I get involved?
Follow the directions in the Readme for your project of interest to share datasets and results with the community, and contribute to your cookbook efforts by providing new guides and tools for everyone. Your contributions are invaluable to help us build robust and inclusive resources for everyone.
If you want to participate in it, join us on the #Data-IS-Better-Together channel of embracing face mismatches.
We look forward to building a better dataset with you!