Recently, Argilla and Hugging Face’s launch data have been excellent together. This is an experiment that builds a preferred dataset of prompt ranking together. A few days later, we are: Over 11,000 rapid ratings, 350 community contributors label their data
Check out our ongoing dashboard for the latest statistics!
This has resulted in the release of 10K_PROMPTS_RANKED. This is a dataset consisting of 10,000 user-rated prompts for prompt quality. I would like to enable more projects like this!
In this post, we explain why it is essential for communities to collaborate in building their datasets and share invitations to participate in the community’s first cohort.
Data is essential for a better model
Data continues to be essential for a better model. Publicly published research, open source experiments, and ongoing evidence from the open source community confirm that better data could lead to better models.
question.
Frequent answers.
Why build datasets together?
Data is essential for machine learning, but many languages, domains and tasks still lack high-quality datasets for training, assessment, and benchmarking. The community shares thousands of models, datasets and demos every day via a hug facehub. The collaboration has resulted in the open access AI community being incredible. Allowing communities to collectively build datasets unlocks unique opportunities for building next-generation datasets and building next-generation models.
By enabling communities to collectively build and improve datasets, people can:
It contributes to the development of open source ML, which requires ML and programming skills. Create a chat dataset for a specific language. Develop benchmark datasets for specific domains. Create a preferred dataset from a variety of participants. Creates a dataset for a specific task. Together as a community, we build a whole new type of dataset.
Importantly, we believe that by building datasets together, it will allow the community to build better datasets, which can be useful for people who don’t know how to code to contribute to AI development.
Make it easier for people to contribute
One of the challenges to many previous efforts to build AI datasets together was to set up efficient annotation tasks. Argilla is an open source tool that helps you create datasets for LLMS and smaller, specialized task-specific models. Embracing Facespace is a platform for building and hosting machine learning demos and applications. Recently, Argilla added support for authentication via hugging face accounts for Argilla instances hosted in the space. This means it takes a few seconds for the user to start contributing to annotation tasks.
I stress-tested this new workflow when I created the 10K_PROMPTS_RANKED DATASET, so I want to support the community as I launch my new aggregation dataset effort.
Join the first cohort of communities who want to build a better dataset together!
I’m very excited about the possibility that it’s locked up by this new simple flow to host annotation tasks. To support the community in building better datasets, we embrace faces and Aguilla and invite interested people and communities to participate in the first cohort of Community Dataset Builders.
The people participating in this cohort are:
Supports facial recognition to hug and create an Argira space. Hugging your face gives you free permanent storage and improves CPU space for participants. Comms and Promising promotes initiatives amplified by Argilla and Hugging Face. Please be invited to join our Cohort Community Channel
Our goal is to support our community as we build better datasets together. We embrace many ideas and want to support our community as much as possible in building a better dataset together.
What kind of projects are you looking for?
We are open to supporting many types of projects, especially those in the existing open source community. We are particularly interested in projects that focus on building datasets for languages, domains, and tasks that are currently underrepresented in the open source community. The only limitation at the moment is that it focuses primarily on text-based datasets. If you have a very cool idea about a multimodal dataset, we look forward to hearing from you, but we may not be able to support you in this first cohort.
Tasks can be fully open or open to members of a particular embracing facehub organization.
If you would like to participate in your first cohort, please join our #Data-IS-Better-Together channel for Hugging Face Disparities. Let us know what you would like to build together!
We look forward to building a better dataset with you!