Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

MrBeast worries AI content could threaten business

October 16, 2025

Google Cloud C4 improves TCO of GPT OSS by 70% using Intel and Hugging Face

October 16, 2025

Embrace the future of AI and business

October 16, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Thursday, October 16
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»I’ll look back
Tools

I’ll look back

versatileaiBy versatileaiApril 22, 2025No Comments3 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email






Over the past few months we have been working on data. With this collaboration between embracing faces and Aguilla and the support of the open source ML community, our goal was to enable collectively creating datasets that have an impact on the open source community.

Now we have decided to move forward with the same goal. We organized it into two sections: community initiatives and cookbook initiatives, to provide an overview of achievements and tasks that everyone can contribute.

Community initiatives

The first step in this initiative focused on the prompt ranking project. Our goal was to create a dataset of synthetic and human-generated 10K prompts ranked by quality. The community response was immediate!

A few days later, over 385 people participated. We have released the DIBT/10K_PROMPTS_RANKED dataset for prompt ranking tasks or synthetic data generation. The dataset was used to build new models such as Spin.

Looking at global support from the community, we realized that English-centered data alone is not enough, and that open LLM does not have enough language-specific benchmarks. So, we created the Multilingual Prompt Evaluation Project (MPEP) with the aim of developing leaderboards for multiple languages. To that end, a subset of 500 high quality prompts from DIBT/10K_PROMPTS_RANKED was chosen to be translated into different languages.

Over 18 language leaders have created spaces for translation. Complete Dutch, Russian or Spanish translations, and more efforts are working towards a full translation of the prompt. Creating a Dataset Builder Community in Discord

We will continue to support our community’s efforts in the future, focusing on building datasets through tools and documentation.

Cookbook effort

As part of DIBT, we also created guides and tools to help the community build valuable datasets themselves.

Domain-specific datasets: Bootstraps the creation of more domain-specific datasets for training models, bringing together engineers and domain experts. DPO/ORPO datasets: help to nurture a community of people building more DPO-style datasets for different languages, domains, and tasks. KTO dataset: To allow communities to create their own KTO datasets.

What did we learn?

The community is eager to participate in these efforts and is excited to work collectively on the dataset. There are existing inequalities that must be overcome to ensure an inclusive and comprehensive benchmark. Currently, the open source community is underestimating datasets for specific languages, domains, and tasks. There are many of the tools that the community needs to effectively cooperate in building valuable datasets.

How can I get involved?

Follow the directions in the Readme for your project of interest to share datasets and results with the community, and contribute to your cookbook efforts by providing new guides and tools for everyone. Your contributions are invaluable to help us build robust and inclusive resources for everyone.

If you want to participate in it, join us on the #Data-IS-Better-Together channel of embracing face mismatches.

We look forward to building a better dataset with you!

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleSmall businesses are embracing AI for quick productive wins, research finds
Next Article Dewan Rakyat speakers need new laws to tackle the delinquent losses arising from AI misuse
versatileai

Related Posts

Tools

Google Cloud C4 improves TCO of GPT OSS by 70% using Intel and Hugging Face

October 16, 2025
Tools

Edit AI videos with the new Veo 3.1 update to Flow

October 16, 2025
Tools

NVIDIA GPUs power Oracle’s next-generation enterprise AI services

October 15, 2025
Add A Comment

Comments are closed.

Top Posts

Corteva, Profluent partners use AI to enable more resilient crops

October 6, 20253 Views

Adds AI tools for on-demand video creation to Google TV sets

October 9, 20252 Views

Google aims to put an AI agent on every desk

October 9, 20252 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Corteva, Profluent partners use AI to enable more resilient crops

October 6, 20253 Views

Adds AI tools for on-demand video creation to Google TV sets

October 9, 20252 Views

Google aims to put an AI agent on every desk

October 9, 20252 Views
Don't Miss

MrBeast worries AI content could threaten business

October 16, 2025

Google Cloud C4 improves TCO of GPT OSS by 70% using Intel and Hugging Face

October 16, 2025

Embrace the future of AI and business

October 16, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?