Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Aprilel-1.6-15b-Thinker: Cost-effective frontier multimodal performance

December 11, 2025

Gemini 3 for developers: new inference, agent features

December 10, 2025

Anifun vs NovelAI: Which anime AI art generator is better for story creation?

December 10, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Thursday, December 11
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»Open LLM Leaderboard: Drop Deep Dive
Tools

Open LLM Leaderboard: Drop Deep Dive

versatileaiBy versatileaiAugust 27, 2025No Comments6 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email

Three new benchmarks have recently been added to the Open LLM leaderboards, Winogrande, GSM8K and Drop, using the original implementation reproduced in Eleutherai Harness. The rough appearance of the drop score revealed something strange is happening. We went on a deep dive to understand what was going on.

Initial observation

A drop (separate inference on a paragraph) is an evaluation in which the model must extract relevant information from an English text paragraph before performing a separate inference procedure (for example, sort or count items to reach the correct answer; see the table below for an example). The metric used is the custom F1 and the exact match score.

Examples of reasoning and paragraphs from the original article.

Three weeks ago, we added it to the open LLM leaderboard and observed that the F1 scores in the prerequisite models follow unexpected trends. When plotting drop scores against the original averages on the leaderboard (ARC, Hellaswag, Truthfulqa, MMLU). However, this was only true for a small number of models, with all other models having very low drop F1 scores below 10.

Two trends can be observed in the drop score. Some follow the mean (diagonal line), while others are stuck around five (vertical line on the right side of the graph).

Normalized Interrogation

During the first deep dive of these surprising behavior, it was observed that the normalization step was not functioning as intended. In some cases, this normalization ignored the correct numerical answer when non-spaced white characters were directly connected (e.g., line return). Let’s take an example. The generation is 10\n\npassage. The 2011 census recorded a population of 1,001,360, with Gold’s answer being 10.

Normalization occurs in several steps, both in the production and in the gold:

Split into separators |, – or generate 10\n\npassage first sequence: no such separator is included, and therefore considered a single entity after this step. The first token to remove punctuation will be 10\n\npassage(: will be deleted). Any string that can be cast to float is considered a number, and is cast to float and re-converted to strings. 10\n\npassage remains the same as it cannot be cast to float, but gold 10 is 10.0. Other steps Many other normalization steps occur (deleting articles, deleting other white people, etc.), and the original example will be 10 passing populations in 2011.0.

However, the overall score is not calculated in a string, but in a bag of words (bows) extracted here (“record”, “population”, “passage”, “census”, “2011.0”, “1001360.0”, “10”}. As you can see, even if the model predicts the correct output, they don’t intersect!

In summary, if a number is followed by any kind of white space other than simple space, it will not pass the normalization of the number, so if it is a number it will not match gold! This initial issue could have ruined the score quite a bit, but obviously that wasn’t the only factor that made the drop score so low. I decided to do some more research.

I’ll jump into the results

Extending the investigation, Zeno friends joined us and embarked on a much more thorough investigation of the results. We looked into five models that represent the problems we noticed with drop scores. The Falcon-180B and Mistral-7B performed poorly compared to what the YI-34B and Tigerbot-70B had expected, when the YI-34B and Tigerbot-70B were performing with XGLM-5 dropping. middle.

If necessary, you can analyze the results in your Zeno project here!

The Zeno team found two more features in terms of functionality.

There is no single model that has obtained correct results at floating points.

At this point, we thought that both cases of failure were actually caused by the same route coefficient. As a stop word token (to end generation):

Floating point answers are systematically interrupted before generation becomes a completely high quality model. This matches a few shot prompt formats and generates the answer to the following questions\n\nplausible prompt: It will only stop after the actual answer.

We assumed that both of these issues could be fixed using \n instead of \n. Stop words as the end of generations.

Change the end of the generated token

So we tried it! Available results were investigated using \n as the end of the generated token. If present, the generated response was split into the first \n and the score was recalculated. Note that this is merely an approximation of the correct results, as it does not correct answers too early. (For example, the answer to floating points) – But since all of them were affected by this issue, it doesn’t give any unfair advantage to any model. However, it’s the best thing you can do without rerunning the model (I wanted to post the community as soon as possible).

The results we got were: – splitting by \n correlates very well with other scores, which correlates with overall performance.

In orange, we see that the scores calculated with the new string correlate much better with the average performance.

So, what’s next?

A simple calculation shows that rerunning a full evaluation of all models is extremely expensive (a full update took 8 years of GPU time, many of which took a drop), and we estimated how much it would cost to rerun only failed examples.

In 10% of cases, Gold’s answer is floating number (e.g. 12.25), and the model’s prediction starts with the correct beginning (in the example, 12), but is cut out with a. – These predictions would have actually been correct if generations continued. We definitely need to rerun them! Our estimates do not count generated sentences that either 40% of other generations (40% of other generations) end with numbers ruined by their normalization.

So, to get the correct results, you need to rerun a huge amount of GPU time, over 50% of the example. This time you need to make sure the implementation you are running is correct.

After discussing with the fantastic Eleutherai team (both Github and Internally) who guided the code and helped with the investigation, it became very clear that the implementation of LM Eval Harness follows the “official drop” code very strictly. So I made the decision to remove drops from the open LLM leaderboard until a new version occurred.

One take on this study is the value of having many of the community look into the benchmarks jointly to detect previously overlooked errors. Again, the power of open source, community and open shine’s development is that it allows for transparent investigation of the root causes of benchmark problems that have been there for years.

We hope that community interested members will join forces with academics working on drop assessments to modify both scoring and normalization. The dataset itself is very interesting and cool so I would like to be able to use it again. We recommend providing feedback on how to evaluate drops on this issue.

We would like to thank many community members for pointing out issues with drop scores. We would also like to thank Eleutherai Harness and the Zeno team for much help on this issue.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleI tried pictures of coloring the page generator and it’s actually adorable
Next Article Unlock Multimodal AI Game Creation Genie 3: Imagen 4, Veo 3, and Next Generation Content Generation | AI News Details
versatileai

Related Posts

Tools

Aprilel-1.6-15b-Thinker: Cost-effective frontier multimodal performance

December 11, 2025
Tools

Gemini 3 for developers: new inference, agent features

December 10, 2025
Tools

Accenture and Anthropic partner to power enterprise AI integration

December 10, 2025
Add A Comment

Comments are closed.

Top Posts

New image verification feature added to Gemini app

December 7, 20256 Views

Aluminum OS is the AI-powered successor to ChromeOS

December 7, 20255 Views

UK and Germany plan to commercialize quantum supercomputing

December 5, 20255 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

New image verification feature added to Gemini app

December 7, 20256 Views

Aluminum OS is the AI-powered successor to ChromeOS

December 7, 20255 Views

UK and Germany plan to commercialize quantum supercomputing

December 5, 20255 Views
Don't Miss

Aprilel-1.6-15b-Thinker: Cost-effective frontier multimodal performance

December 11, 2025

Gemini 3 for developers: new inference, agent features

December 10, 2025

Anifun vs NovelAI: Which anime AI art generator is better for story creation?

December 10, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?