Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Google’s new open model based on Gemini 2.0

April 5, 2026

5 best practices for securing AI systems

April 4, 2026

Breaking through the frontiers of computer use

April 4, 2026
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Sunday, April 5
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»I’m done trusting black box leaderboards for the community.
Tools

I’m done trusting black box leaderboards for the community.

versatileaiBy versatileaiFebruary 7, 2026No Comments4 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

TL;DR: You can now host leaderboards on Hugging Face’s benchmark dataset. The model stores its own reputation score. Everything links together. The community can submit results through PR. The verified badge proves that you can reproduce the results.

Ratings are collapsing

Let’s be realistic about the evals situation in 2026. MMLU is saturated at over 91%. GSM8K reached over 94%. HumanEval has been conquered. However, some models that pass the benchmark are still not able to reliably handle web browsing, writing production code, or multi-step tasks without hallucinations, based on usage reports. There is a clear gap between benchmark scores and actual performance.

Additionally, there are other gaps within the reported benchmark scores. Multiple sources report different results. From model cards to papers to assessment platforms, reported scores are inconsistent. As a result, communities lack a single source of truth.

What to ship

Decentralized and transparent evaluation reporting.

We intend to take Hugging Face Hub evaluation in a new direction by decentralizing reporting and allowing the entire community to openly report benchmark scores. Start with a shortlist of four benchmarks and expand to the most relevant benchmarks over time.

For benchmarks: You can now register dataset repositories as benchmarks (MMLU-Pro, GPQA, and HLE are already live). Results reported from across the hub are automatically aggregated and a leaderboard is displayed on the dataset card. This benchmark defines an eval specification via eval.yaml based on the Inspect AI format, so anyone can reproduce it. Reported results must match the task definition.

Benchmark image

For models: Eval scores are located in .eval_results/*.yaml in the model repository. These appear on the model card and are entered into the benchmark dataset. Both modeler results and open pull requests for results are aggregated. Modelers can close the score PR to hide the results.

For the community: Any user can submit evaluation results for any model via PR. The results are displayed as a “community” without waiting for the modeler to merge or finish. The community can link to sources such as papers, model cards, third-party evaluation platforms, and inspect evaluation logs. The community can discuss scores as well as PRs. Since the Hub is Git-based, there is a history of when evals were added, changes were made, etc. The source looks like this:

model image

For more information on evaluation results, please see our documentation.

Hub model score

why is this important

Decentralizing assessment exposes scores that already exist across the community in sources such as model cards and papers. Publishing these scores allows the community to aggregate, track, and understand scores across fields based on those scores. All scores are also exposed through the hub API, making it easy to aggregate and build curated leaderboards, dashboards, and more.

Community ratings do not replace benchmarks, so leaderboards and closed ratings with published results are still important. However, we believe it is important to contribute to this field with open evaluation results based on reproducible evaluation specifications.

This does not solve benchmark saturation or bridge the gap between benchmark and reality. It also never stops training on the test set. However, it makes games visible by revealing what was rated, when, who rated it, and how.

We primarily want the hub to be an active place to build and share reproducible benchmarks. In particular, we will focus on new tasks and domains that further challenge the SOTA model.

Let’s get started

Add evaluation results: Publish the evaluation you conducted as a YAML file to .eval_results/ in any model repository.

Check your benchmark dataset scores.

Register a new benchmark. Add eval.yaml to your dataset repository and contact us for shortlisting.

This feature is in beta. We are doing construction outdoors. Feedback is welcome.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleDan Herbatschek, New York CEO of Ramsey Theory Capital, provides new real-time governance for enterprise AI systems as AI safety laws accelerate
Next Article Viyou.ai: Create stunning AI videos, images, and viral content with unlimited freedom
versatileai

Related Posts

Tools

Google’s new open model based on Gemini 2.0

April 5, 2026
Tools

5 best practices for securing AI systems

April 4, 2026
Tools

Breaking through the frontiers of computer use

April 4, 2026
Add A Comment

Comments are closed.

Top Posts

We had Claude fine-tune our open source LLM

December 5, 202513 Views

Faster Text Generation with Self-Speculative Decoding

February 13, 202512 Views

Build a great dataset for video generation

February 12, 202512 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

We had Claude fine-tune our open source LLM

December 5, 202513 Views

Faster Text Generation with Self-Speculative Decoding

February 13, 202512 Views

Build a great dataset for video generation

February 12, 202512 Views
Don't Miss

Google’s new open model based on Gemini 2.0

April 5, 2026

5 best practices for securing AI systems

April 4, 2026

Breaking through the frontiers of computer use

April 4, 2026
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?