I’m done trusting black box leaderboards for the community.

TL;DR: You can now host leaderboards on Hugging Face’s benchmark dataset. The model stores its own reputation score. Everything links together. The community can submit results through PR. The verified badge proves that you can reproduce the results.

Ratings are collapsing

Let’s be realistic about the evals situation in 2026. MMLU is saturated at over 91%. GSM8K reached over 94%. HumanEval has been conquered. However, some models that pass the benchmark are still not able to reliably handle web browsing, writing production code, or multi-step tasks without hallucinations, based on usage reports. There is a clear gap between benchmark scores and actual performance.

Additionally, there are other gaps within the reported benchmark scores. Multiple sources report different results. From model cards to papers to assessment platforms, reported scores are inconsistent. As a result, communities lack a single source of truth.

What to ship

Decentralized and transparent evaluation reporting.

We intend to take Hugging Face Hub evaluation in a new direction by decentralizing reporting and allowing the entire community to openly report benchmark scores. Start with a shortlist of four benchmarks and expand to the most relevant benchmarks over time.

For benchmarks: You can now register dataset repositories as benchmarks (MMLU-Pro, GPQA, and HLE are already live). Results reported from across the hub are automatically aggregated and a leaderboard is displayed on the dataset card. This benchmark defines an eval specification via eval.yaml based on the Inspect AI format, so anyone can reproduce it. Reported results must match the task definition.

For models: Eval scores are located in .eval_results/*.yaml in the model repository. These appear on the model card and are entered into the benchmark dataset. Both modeler results and open pull requests for results are aggregated. Modelers can close the score PR to hide the results.

For the community: Any user can submit evaluation results for any model via PR. The results are displayed as a “community” without waiting for the modeler to merge or finish. The community can link to sources such as papers, model cards, third-party evaluation platforms, and inspect evaluation logs. The community can discuss scores as well as PRs. Since the Hub is Git-based, there is a history of when evals were added, changes were made, etc. The source looks like this:

For more information on evaluation results, please see our documentation.

Hub model score

why is this important

Decentralizing assessment exposes scores that already exist across the community in sources such as model cards and papers. Publishing these scores allows the community to aggregate, track, and understand scores across fields based on those scores. All scores are also exposed through the hub API, making it easy to aggregate and build curated leaderboards, dashboards, and more.

Community ratings do not replace benchmarks, so leaderboards and closed ratings with published results are still important. However, we believe it is important to contribute to this field with open evaluation results based on reproducible evaluation specifications.

This does not solve benchmark saturation or bridge the gap between benchmark and reality. It also never stops training on the test set. However, it makes games visible by revealing what was rated, when, who rated it, and how.

We primarily want the hub to be an active place to build and share reproducible benchmarks. In particular, we will focus on new tasks and domains that further challenge the SOTA model.

Let’s get started

Add evaluation results: Publish the evaluation you conducted as a YAML file to .eval_results/ in any model repository.

Check your benchmark dataset scores.

This feature is in beta. We are doing construction outdoors. Feedback is welcome.

versatileai

See Full Bio

What's Hot

I’m done trusting black box leaderboards for the community.

Google Gemini’s image editing gets a major upgrade

How separating logic and search improves AI agent scalability

Google Gemini’s image editing gets a major upgrade

How separating logic and search improves AI agent scalability

Introduction to SyGra Studio

The future of PR is about automated workflows, not faster content creation – Unite.AI

South Carolina lawmakers reject proposal to block new state law regulating AI

PepsiCo uses AI to rethink how factories are designed and updated

Most Popular

The future of PR is about automated workflows, not faster content creation – Unite.AI

South Carolina lawmakers reject proposal to block new state law regulating AI

PepsiCo uses AI to rethink how factories are designed and updated

Don't Miss

I’m done trusting black box leaderboards for the community.

Google Gemini’s image editing gets a major upgrade

How separating logic and search improves AI agent scalability

Subscribe to Updates

What's Hot

I’m done trusting black box leaderboards for the community.

Ratings are collapsing

What to ship

why is this important

Let’s get started

Related Posts