Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

NYC Book Release: The Empire of AI

June 1, 2025

Fully transparent and tolerant self-regulation for code generation

June 1, 2025

AI-Media and Lightning International join forces to make first channel accessible to everyone

May 31, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Sunday, June 1
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»Open and fix the LLM leaderboard using Math-Verify
Tools

Open and fix the LLM leaderboard using Math-Verify

By February 15, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email




Nathan Habib's avatar




Three weeks ago, I showed how difficult it is to properly evaluate LLM performance in mathematical problems and introduced Math-Verify, a better solution for verifying mathematical models (read more about the presentation )!

Today, we can use Math-Verify to thoroughly reevaluate all 3,751 models submitted to the Open LLM leaderboard so far, and share the more unbiased and robust model comparisons. I’m excited!

Why mathematics ratings on the open LLM leaderboard have broken

オープンLLMリーダーボードは、ハグするフェイスハブで最も使用されているリーダーボードです。さまざまなタスクでオープンな大手言語モデル(LLM)パフォーマンスを比較します。 One of these tasks, called Math-Hard, is particularly relevant to mathematical problems. Evaluate how well LLMS solves high school and college level math problems. Using a 5-shot approach, we use the highest difficulty level (level 5) of Hendrycks Math Dataset, spreading over seven topics (Precalculus, Prealgebra, Algebra, Intermediate Algebra, Count/Probability and Number Theory) (Models are provided with models) 5 examples of prompts to show you how it should answer.

A typical question looks like this:

For all real numbers $r and $s, define the mathematical operation $\#$ so that the following conditions apply: $r\\#\0 = r, r\\#\s = s \\#\r$, and $(r+1)\\#\s =(r\\#\s) +s+1$. How much is $11\\#\5$ worth?

The answer is:

71

On the leaderboard, the model must finish the answer with a very specific string (follow Minerva-Math Paper):

“The final answer is (answer). I hope it’s correct.”

The leaderboard then parses (answers) in Sympy, converts it into symbolic representation (simplifies the values ​​if necessary) and ultimately compares it to the gold target.

However, users have reported many of the issues mentioned above.

First, a recurring problem was that some models were unable to follow the answer format expected from the example. Instead, I output other sentences to show the answers. The answer was marked as wrong, even if it is actually correct, as the format is not followed! (これは、あなたが興味を持っているのが「モデルが数学にどれだけ良いか」である場合、特に問題です)。

example Issuue failed extract 7*sqrt(2) + 14 No, therefore the sum of the infinite geometry series is (\frac {7}{9}). Failed extracts 7/9 none (p(n)) and (p(n+1)) share common factors above 1 (\boxed {41}). Failed extraction 4 no \frac {1} {9} Extraction Failed 1/9 no \boxed {5} Car failed extraction 5 no

The next step also presented some issues by converting (answer) into symbolic expressions. This time, it is linked to sympy analysis.

📄Example I hope it is correct. Parametric EQ EQ (2x + 4Y + Z -19, 0) 0 (23) extraction failure failed 23 ((-\inf, -14)\cup(-3, \infty))) Extraction coupling 100% no failure of extraction for symbol 1 for no failure (interval.open(-oo, -14), interval.open(-3, oo)) \begin {pmatrix}\frac {1}{50}&\ frac {7}{50}\frac {7}{50}&\frac {49}{50}\end{pmatrix}MatrixMatrixFailed extraction by matrix (((1/50, 7/50)),(7 /50, 49/50)))) None

In the final step, many problems also occurred when comparing the extracted answers with the target formula.

Example 1No Support for Variable Assignment True False Matrix.Ones == Matrix.ones Matrix no Support Equivalence True false {1}\Union {1,4} == No Support for {1,4} set

All of these issues have been completely fixed with the new Math-Verify parser!

Which model is the best in mathematics? A complete reconstruction of the card thanks to a more fairer rating

As all these problems tend to accumulate, some models suffered deeply from this, and their performance was strongly underestimated. So it was as easy as removing the previous evaluator and modifying only the 3 lines of code. (You can even try it with your math avoidance!)

Therefore, this means reevaluating all submitted models since June, completely overhauling the top 20 models in the mathematical subset of the leaderboard.

The impact of change

On average, the model solves 61 problems overall, representing an increase of 4.66 points overall!

score_change

最も重要な改善を示した2つのサブセットは、それぞれ8.27と6.93の利益を持つ代数関連(代数とPrealgebra)の両方でした。極端な場合、一部のモデルは、これらのサブセットで90ポイント近くの改善を実証しました。 We believe that the greatest improvement was seen as these subsets frequently include answers and matrices presented as sets (for questions with multiple solutions). Math-Verify enhances the processing of both answer types and contributes to these notable benefits.

subset_change

Changes to model family

We first discovered a mathematical evaluation problem when examining Qwen models with an unusually low score compared to self-reported performance. After the introduction of Math-Verify, these models scored more than doubled, showing previous severe performance underestimation.

However, the Qwen model is not alone. Another major family affected is Deepseek. After switching to Math-Verify, the DeepSeek model almost tripled its score! This is because their answers are usually wrapped in a box (\boxed {}) notation that the old evaluator could not extract.
model_family_change

Math Hard Leaderboard Changes

As mentioned at the beginning, the top 20 rankings have undergone a major change, with Nvidia’s Acemath model currently dominating the mathematics hard leaderboard. Another major beneficiary of this change is the Qwen derivative. This is the only model that most often ranks just below Acemath. Below is a complete table comparing the leaderboard rankings of the old top 20.

math_hard_leaderboard_change

Changes to the leaderboard

Finally, we looked into how the results across the leaderboard have evolved. The top four remain unchanged, but the rest have undergone significant changes. With the rise of multiple QWEN derivatives in Math Subset, the presence of QWEN models during the top 20 has further grown the models that have grown further in the overall results.
LeaderBoard_Change

Many other models also jumped completely into the rankings, winning over 200 locations! You can see the results in detail on the Open LLM leaderboard.

I’ll summarize

The introduction of Math-Verify has significantly improved the accuracy and fairness of ratings on the Open LLM leaderboard. This has restructured the leaderboard and many models have shown significant improvements in scores.

We encourage all developers and researchers to adopt Mathematics Verphy for their own mathematics assessments. This allows you to evaluate the model with more reliable results. Additionally, explore updated rankings and see how your favorite models have changed their performance.

author avatar
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleMalakand’s Scholar presents groundbreaking research into AI security risks
Next Article Is that article stolen? Why Media Companies are suing AI Companies | Opinion

Related Posts

Tools

Fully transparent and tolerant self-regulation for code generation

June 1, 2025
Tools

Promote your creativity with new generation media models and tools

May 31, 2025
Tools

Deepseek’s latest AI model is a “big step back” for free speech

May 31, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

The UAE announces bold AI-led plans to revolutionize the law

April 22, 20253 Views

The UAE will use artificial intelligence to develop new laws

April 22, 20253 Views

New report on national security risks from weakened AI safety frameworks

April 22, 20253 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

The UAE announces bold AI-led plans to revolutionize the law

April 22, 20253 Views

The UAE will use artificial intelligence to develop new laws

April 22, 20253 Views

New report on national security risks from weakened AI safety frameworks

April 22, 20253 Views
Don't Miss

NYC Book Release: The Empire of AI

June 1, 2025

Fully transparent and tolerant self-regulation for code generation

June 1, 2025

AI-Media and Lightning International join forces to make first channel accessible to everyone

May 31, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?