Open LLM Leaderboard: Drop Deep Dive

Three new benchmarks have recently been added to the Open LLM leaderboards, Winogrande, GSM8K and Drop, using the original implementation reproduced in Eleutherai Harness. The rough appearance of the drop score revealed something strange is happening. We went on a deep dive to understand what was going on.

Initial observation

A drop (separate inference on a paragraph) is an evaluation in which the model must extract relevant information from an English text paragraph before performing a separate inference procedure (for example, sort or count items to reach the correct answer; see the table below for an example). The metric used is the custom F1 and the exact match score.

Examples of reasoning and paragraphs from the original article.

Three weeks ago, we added it to the open LLM leaderboard and observed that the F1 scores in the prerequisite models follow unexpected trends. When plotting drop scores against the original averages on the leaderboard (ARC, Hellaswag, Truthfulqa, MMLU). However, this was only true for a small number of models, with all other models having very low drop F1 scores below 10.

Two trends can be observed in the drop score. Some follow the mean (diagonal line), while others are stuck around five (vertical line on the right side of the graph).

Normalized Interrogation

During the first deep dive of these surprising behavior, it was observed that the normalization step was not functioning as intended. In some cases, this normalization ignored the correct numerical answer when non-spaced white characters were directly connected (e.g., line return). Let’s take an example. The generation is 10\n\npassage. The 2011 census recorded a population of 1,001,360, with Gold’s answer being 10.

Normalization occurs in several steps, both in the production and in the gold:

Split into separators |, – or generate 10\n\npassage first sequence: no such separator is included, and therefore considered a single entity after this step. The first token to remove punctuation will be 10\n\npassage(: will be deleted). Any string that can be cast to float is considered a number, and is cast to float and re-converted to strings. 10\n\npassage remains the same as it cannot be cast to float, but gold 10 is 10.0. Other steps Many other normalization steps occur (deleting articles, deleting other white people, etc.), and the original example will be 10 passing populations in 2011.0.

However, the overall score is not calculated in a string, but in a bag of words (bows) extracted here (“record”, “population”, “passage”, “census”, “2011.0”, “1001360.0”, “10”}. As you can see, even if the model predicts the correct output, they don’t intersect!

In summary, if a number is followed by any kind of white space other than simple space, it will not pass the normalization of the number, so if it is a number it will not match gold! This initial issue could have ruined the score quite a bit, but obviously that wasn’t the only factor that made the drop score so low. I decided to do some more research.

I’ll jump into the results

Extending the investigation, Zeno friends joined us and embarked on a much more thorough investigation of the results. We looked into five models that represent the problems we noticed with drop scores. The Falcon-180B and Mistral-7B performed poorly compared to what the YI-34B and Tigerbot-70B had expected, when the YI-34B and Tigerbot-70B were performing with XGLM-5 dropping. middle.

If necessary, you can analyze the results in your Zeno project here!

The Zeno team found two more features in terms of functionality.

There is no single model that has obtained correct results at floating points.

At this point, we thought that both cases of failure were actually caused by the same route coefficient. As a stop word token (to end generation):

Floating point answers are systematically interrupted before generation becomes a completely high quality model. This matches a few shot prompt formats and generates the answer to the following questions\n\nplausible prompt: It will only stop after the actual answer.

We assumed that both of these issues could be fixed using \n instead of \n. Stop words as the end of generations.

Change the end of the generated token

So we tried it! Available results were investigated using \n as the end of the generated token. If present, the generated response was split into the first \n and the score was recalculated. Note that this is merely an approximation of the correct results, as it does not correct answers too early. (For example, the answer to floating points) – But since all of them were affected by this issue, it doesn’t give any unfair advantage to any model. However, it’s the best thing you can do without rerunning the model (I wanted to post the community as soon as possible).

The results we got were: – splitting by \n correlates very well with other scores, which correlates with overall performance.

In orange, we see that the scores calculated with the new string correlate much better with the average performance.

So, what’s next?

A simple calculation shows that rerunning a full evaluation of all models is extremely expensive (a full update took 8 years of GPU time, many of which took a drop), and we estimated how much it would cost to rerun only failed examples.

In 10% of cases, Gold’s answer is floating number (e.g. 12.25), and the model’s prediction starts with the correct beginning (in the example, 12), but is cut out with a. – These predictions would have actually been correct if generations continued. We definitely need to rerun them! Our estimates do not count generated sentences that either 40% of other generations (40% of other generations) end with numbers ruined by their normalization.

So, to get the correct results, you need to rerun a huge amount of GPU time, over 50% of the example. This time you need to make sure the implementation you are running is correct.

After discussing with the fantastic Eleutherai team (both Github and Internally) who guided the code and helped with the investigation, it became very clear that the implementation of LM Eval Harness follows the “official drop” code very strictly. So I made the decision to remove drops from the open LLM leaderboard until a new version occurred.

One take on this study is the value of having many of the community look into the benchmarks jointly to detect previously overlooked errors. Again, the power of open source, community and open shine’s development is that it allows for transparent investigation of the root causes of benchmark problems that have been there for years.

We hope that community interested members will join forces with academics working on drop assessments to modify both scoring and normalization. The dataset itself is very interesting and cool so I would like to be able to use it again. We recommend providing feedback on how to evaluate drops on this issue.

We would like to thank many community members for pointing out issues with drop scores. We would also like to thank Eleutherai Harness and the Zeno team for much help on this issue.

versatileai

See Full Bio

What's Hot

Start building with Nano Banana 2 Lite and Gemini Omni Flash

Wimbledon adds IBM AI tools for live match coverage

Achieve density and score across distributions with one transformer

Start building with Nano Banana 2 Lite and Gemini Omni Flash

Wimbledon adds IBM AI tools for live match coverage

Achieve density and score across distributions with one transformer

Top 5 NSFW AI Generators for Surreal NSFW AI Art in 2025

Practical 3D Asset Generation: A Step-by-Step Guide

Shutterstock pioneers “research license” model with Lightricks, lowering barriers to AI training data

Most Popular

Top 5 NSFW AI Generators for Surreal NSFW AI Art in 2025

Practical 3D Asset Generation: A Step-by-Step Guide

Shutterstock pioneers “research license” model with Lightricks, lowering barriers to AI training data

Don't Miss

Start building with Nano Banana 2 Lite and Gemini Omni Flash

Wimbledon adds IBM AI tools for live match coverage

Achieve density and score across distributions with one transformer

Subscribe to Updates

What's Hot

Open LLM Leaderboard: Drop Deep Dive

Initial observation

Normalized Interrogation

I’ll jump into the results

Change the end of the generated token

So, what’s next?

Related Posts