Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

I’m done trusting black box leaderboards for the community.

February 7, 2026

Google Gemini’s image editing gets a major upgrade

February 7, 2026

How separating logic and search improves AI agent scalability

February 6, 2026
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Saturday, February 7
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»I’m done trusting black box leaderboards for the community.
Tools

I’m done trusting black box leaderboards for the community.

versatileaiBy versatileaiFebruary 7, 2026No Comments4 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

TL;DR: You can now host leaderboards on Hugging Face’s benchmark dataset. The model stores its own reputation score. Everything links together. The community can submit results through PR. The verified badge proves that you can reproduce the results.

Ratings are collapsing

Let’s be realistic about the evals situation in 2026. MMLU is saturated at over 91%. GSM8K reached over 94%. HumanEval has been conquered. However, some models that pass the benchmark are still not able to reliably handle web browsing, writing production code, or multi-step tasks without hallucinations, based on usage reports. There is a clear gap between benchmark scores and actual performance.

Additionally, there are other gaps within the reported benchmark scores. Multiple sources report different results. From model cards to papers to assessment platforms, reported scores are inconsistent. As a result, communities lack a single source of truth.

What to ship

Decentralized and transparent evaluation reporting.

We intend to take Hugging Face Hub evaluation in a new direction by decentralizing reporting and allowing the entire community to openly report benchmark scores. Start with a shortlist of four benchmarks and expand to the most relevant benchmarks over time.

For benchmarks: You can now register dataset repositories as benchmarks (MMLU-Pro, GPQA, and HLE are already live). Results reported from across the hub are automatically aggregated and a leaderboard is displayed on the dataset card. This benchmark defines an eval specification via eval.yaml based on the Inspect AI format, so anyone can reproduce it. Reported results must match the task definition.

Benchmark image

For models: Eval scores are located in .eval_results/*.yaml in the model repository. These appear on the model card and are entered into the benchmark dataset. Both modeler results and open pull requests for results are aggregated. Modelers can close the score PR to hide the results.

For the community: Any user can submit evaluation results for any model via PR. The results are displayed as a “community” without waiting for the modeler to merge or finish. The community can link to sources such as papers, model cards, third-party evaluation platforms, and inspect evaluation logs. The community can discuss scores as well as PRs. Since the Hub is Git-based, there is a history of when evals were added, changes were made, etc. The source looks like this:

model image

For more information on evaluation results, please see our documentation.

Hub model score

why is this important

Decentralizing assessment exposes scores that already exist across the community in sources such as model cards and papers. Publishing these scores allows the community to aggregate, track, and understand scores across fields based on those scores. All scores are also exposed through the hub API, making it easy to aggregate and build curated leaderboards, dashboards, and more.

Community ratings do not replace benchmarks, so leaderboards and closed ratings with published results are still important. However, we believe it is important to contribute to this field with open evaluation results based on reproducible evaluation specifications.

This does not solve benchmark saturation or bridge the gap between benchmark and reality. It also never stops training on the test set. However, it makes games visible by revealing what was rated, when, who rated it, and how.

We primarily want the hub to be an active place to build and share reproducible benchmarks. In particular, we will focus on new tasks and domains that further challenge the SOTA model.

Let’s get started

Add evaluation results: Publish the evaluation you conducted as a YAML file to .eval_results/ in any model repository.

Check your benchmark dataset scores.

Register a new benchmark. Add eval.yaml to your dataset repository and contact us for shortlisting.

This feature is in beta. We are doing construction outdoors. Feedback is welcome.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleGoogle Gemini’s image editing gets a major upgrade
versatileai

Related Posts

Tools

Google Gemini’s image editing gets a major upgrade

February 7, 2026
Tools

How separating logic and search improves AI agent scalability

February 6, 2026
Tools

Introduction to SyGra Studio

February 6, 2026
Add A Comment

Comments are closed.

Top Posts

The future of PR is about automated workflows, not faster content creation – Unite.AI

December 9, 202510 Views

South Carolina lawmakers reject proposal to block new state law regulating AI

December 2, 20258 Views

PepsiCo uses AI to rethink how factories are designed and updated

February 2, 20267 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

The future of PR is about automated workflows, not faster content creation – Unite.AI

December 9, 202510 Views

South Carolina lawmakers reject proposal to block new state law regulating AI

December 2, 20258 Views

PepsiCo uses AI to rethink how factories are designed and updated

February 2, 20267 Views
Don't Miss

I’m done trusting black box leaderboards for the community.

February 7, 2026

Google Gemini’s image editing gets a major upgrade

February 7, 2026

How separating logic and search improves AI agent scalability

February 6, 2026
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?