Qwen researchers introduce CodeElo: an AI benchmark designed to assess competitive-level coding skills for LLMs using human-equivalent Elo ratings

Large-scale language models (LLMs) have brought significant advances to AI applications, including code generation. However, assessing its true ability is not easy. Existing benchmarks such as LiveCodeBench and USACO have limitations. They often lack robust private test cases, do not support specialized decision systems, and operate in inconsistent execution environments. These gaps make it difficult to fairly compare LLM’s performance to that of human programmers. To reliably assess the reasoning capabilities of LLMs, a standardized framework tailored to real-world programming challenges is essential.

To address these challenges, the Qwen research team introduced CodeElo, a benchmark designed to assess LLMs’ competitive-level coding skills using human-equivalent Elo ratings. CodeElo’s questions come from CodeForces, a platform with a reputation for rigorous programming competitions. CodeElo ensures accurate evaluation by submitting your solutions directly to the CodeForces platform. Address issues such as false positives and support issues that require special judgment. Additionally, the benchmark Elo rating system reflects human performance rankings, allowing for meaningful comparisons between LLM and human participants. CodeElo provides a new way to measure LLM performance in competitive coding.

Technical details and benefits

CodeElo is built on three key elements: comprehensive question selection, robust assessment methodology, and standardized assessment calculations. Questions are categorized by contest category, difficulty level, and algorithm tags, and a thorough evaluation is provided. Submissions are tested on the CodeForces platform and a special evaluation mechanism is used to ensure accurate judgment. This approach eliminates the need for hidden test cases and provides reliable feedback. The Elo rating system evaluates accuracy, considers problem difficulty, and penalizes errors. By encouraging high-quality solutions, CodeElo provides a nuanced and effective tool for evaluating coding models.

Results and insights

We tested CodeElo on 30 open source LLMs and 3 proprietary LLMs and gained valuable insights. OpenAI’s o1-mini model performed best, achieving an Elo rating of 1578 and outperforming 90% of human participants. Among the open source models, QwQ-32B-Preview showed the best performance with a score of 1261. However, many models struggled with simple problems and often ranked in the bottom 20% of human participants. The analysis shows that the model performs well in categories such as mathematics and implementation, but is more difficult in dynamic programming and tree algorithms. Additionally, models perform better when coded in C++, a common preference among competitive programmers. These results highlight areas where LLM needs improvement.

conclusion

CodeElo is an important step in assessing your LLM coding abilities. By addressing the limitations of previous benchmarks, we provide a reliable and standardized framework for evaluating competitive-level code generation. Insights from CodeElo not only reveal the strengths and weaknesses of current models, but also inform future developments in AI-driven code generation. As AI continues to evolve, benchmarks like CodeElo will be essential to enable LLMs to effectively address real-world programming challenges.

Check out our papers, datasets, and leaderboards. All credit for this study goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram channel and LinkedIn group. Don’t forget to join the 60,000+ ML SubReddit.

🚨 Upcoming Free AI Webinar (January 15, 2025): Improving LLM Accuracy with Synthetic Data and Evaluation Intelligence – Attend this webinar to learn how to improve LLM model performance and accuracy while protecting data privacy. Gain actionable insights.

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for social good. His latest endeavor is the launch of Marktechpost, an artificial intelligence media platform. It stands out for its thorough coverage of machine learning and deep learning news, which is technically sound and easily understood by a wide audience. The platform boasts over 2 million views per month, which shows its popularity among viewers.

🧵🧵 Follow us on Twitter and get regular updates on AI research and development here…

See Full Bio

What's Hot

Salesforce AgentForce 3 brings visibility to AI agents

Generated AI Media Production | Deloitte Us

6 Key findings from marketing leaders

How to turn AI into your own research assistant with this free Google tool

A new study of 408 researchers revealed split sentiment, a surge in recruitment and rising barriers to trust

Info-Tech Research Group publishes insights into how AI can make a difference

New Star: Discover why 보니 is the future of AI art

How to build an MCP server with Gradio

A family business built on trust, now supported by AI.

Most Popular