Large-scale language models (LLMs) have brought significant advances to AI applications, including code generation. However, assessing its true ability is not easy. Existing benchmarks such as LiveCodeBench and USACO have limitations. They often lack robust private test cases, do not support specialized decision systems, and operate in inconsistent execution environments. These gaps make it difficult to fairly compare LLM’s performance to that of human programmers. To reliably assess the reasoning capabilities of LLMs, a standardized framework tailored to real-world programming challenges is essential.
To address these challenges, the Qwen research team introduced CodeElo, a benchmark designed to assess LLMs’ competitive-level coding skills using human-equivalent Elo ratings. CodeElo’s questions come from CodeForces, a platform with a reputation for rigorous programming competitions. CodeElo ensures accurate evaluation by submitting your solutions directly to the CodeForces platform. Address issues such as false positives and support issues that require special judgment. Additionally, the benchmark Elo rating system reflects human performance rankings, allowing for meaningful comparisons between LLM and human participants. CodeElo provides a new way to measure LLM performance in competitive coding.
Technical details and benefits
CodeElo is built on three key elements: comprehensive question selection, robust assessment methodology, and standardized assessment calculations. Questions are categorized by contest category, difficulty level, and algorithm tags, and a thorough evaluation is provided. Submissions are tested on the CodeForces platform and a special evaluation mechanism is used to ensure accurate judgment. This approach eliminates the need for hidden test cases and provides reliable feedback. The Elo rating system evaluates accuracy, considers problem difficulty, and penalizes errors. By encouraging high-quality solutions, CodeElo provides a nuanced and effective tool for evaluating coding models.
Results and insights
We tested CodeElo on 30 open source LLMs and 3 proprietary LLMs and gained valuable insights. OpenAI’s o1-mini model performed best, achieving an Elo rating of 1578 and outperforming 90% of human participants. Among the open source models, QwQ-32B-Preview showed the best performance with a score of 1261. However, many models struggled with simple problems and often ranked in the bottom 20% of human participants. The analysis shows that the model performs well in categories such as mathematics and implementation, but is more difficult in dynamic programming and tree algorithms. Additionally, models perform better when coded in C++, a common preference among competitive programmers. These results highlight areas where LLM needs improvement.

conclusion
CodeElo is an important step in assessing your LLM coding abilities. By addressing the limitations of previous benchmarks, we provide a reliable and standardized framework for evaluating competitive-level code generation. Insights from CodeElo not only reveal the strengths and weaknesses of current models, but also inform future developments in AI-driven code generation. As AI continues to evolve, benchmarks like CodeElo will be essential to enable LLMs to effectively address real-world programming challenges.
Check out our papers, datasets, and leaderboards. All credit for this study goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram channel and LinkedIn group. Don’t forget to join the 60,000+ ML SubReddit.
🚨 Upcoming Free AI Webinar (January 15, 2025): Improving LLM Accuracy with Synthetic Data and Evaluation Intelligence – Attend this webinar to learn how to improve LLM model performance and accuracy while protecting data privacy. Gain actionable insights.

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for social good. His latest endeavor is the launch of Marktechpost, an artificial intelligence media platform. It stands out for its thorough coverage of machine learning and deep learning news, which is technically sound and easily understood by a wide audience. The platform boasts over 2 million views per month, which shows its popularity among viewers.
🧵🧵 Follow us on Twitter and get regular updates on AI research and development here…