Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Introducing training clusters as a service

June 12, 2025

Qualcomm (QCOM) expands AI research at new centres in Vietnam

June 11, 2025

What is AI Cleaning and Why Businesses Should Stop Exaggerating AI Proficiency | Articles

June 11, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Thursday, June 12
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»Introducing the LiveCodeBench Leaderboard – Overall and Contaminated Assessment of Code LLM
Tools

Introducing the LiveCodeBench Leaderboard – Overall and Contaminated Assessment of Code LLM

versatileaiBy versatileaiJune 11, 2025No Comments4 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

We look forward to introducing the LiveCodebench leaderboard based on LiveCodebench, a new benchmark developed by researchers at UC Berkeley, MIT, and Cornell, to measure the code generation capabilities of LLMS.

LiveCodebench collects long-term coding issues from various coding contest platforms and annotates issues on release dates. Annotations are used to evaluate models of problem sets released in different time windows, allowing for a “assessment over time” strategy that helps detect and prevent contamination. In addition to the usual code generation tasks, LiveCodebench also evaluates self-healing, test output prediction, and code execution, providing a more overall view of the coding capabilities needed by next-generation AI programming agents.

LiveCodeBench Scenarios and Evaluations

The LiveCodebench issues are curated from the coding of competitive platforms (LeetCode, Atcoder, and CodeForces). These websites regularly host contests that contain questions that evaluate participants’ coding and problem solving skills. The problem consists of a natural language problem statement and an example I/O example, with the goal being to create a program that passes a series of hidden tests. With thousands of participants engaged in the sport, it is guaranteed that the issues are examined for clarity and accuracy.

LiveCodeBench uses problems collected to build four coding scenarios

Code generation. This model is given a problem statement containing a natural language description and sample tests (input output pairs) and is subject to generation of the correct solution. Evaluations are based on the functional correctness of the generated code and are determined using a set of test cases. Self-repair. This model is given a problem statement and generates a candidate program, similar to the code generation scenario above. In the event of a mistake, the model will provide error feedback (either an exception message or a faulty test case) and will be forced to generate a correction. The evaluation is performed using the same functional correctness as above. Code execution. This model provides a program snippet consisting of functions (f) along with test input, with the challenge of predicting the output of the program in the input test case. The evaluation is based on run-based accuracy metrics. Assertion Assert f(input) == generated_output path, the output of the model is considered correct. Test output prediction. This model is given a problem statement along with the test case input, and is imposed to generate the expected output of the input. Tests are generated only from the statement in question without the need for implementation of the function, and the output is evaluated using an exact match checker.

For each scenario, an evaluation is performed using the Pass@1 metric. The metric is calculated using the ratio of the correct answer counts in the counts of total trials, capturing the probability of generating a correct answer.

Preventing benchmark contamination

Contamination is one of the major bottlenecks in current LLM assessments. Even within LLM coding assessments, there are reports of evidence of contamination and overfitting in standard benchmarks such as Humanval ((1) and (2)).

For this reason, annotate the issue with the release date issue in LiveCodebench. This way, for new models with training cutoff date D, you can calculate the scores for problems released after D to measure the generalization of invisible problems.

LiveCodeBench formalizes this with the “Scroll over time” feature that allows you to select questions within a specific time window. You can try it out on the leaderboard above!

Survey results

We find it:

Although model performance is correlated in various scenarios, relative performance and ordering may differ in four scenarios using the GPT-4-turbo. Furthermore, its margin increases in self-healing tasks, highlighting the ability to obtain compiler feedback. Claude-3-Opus overtakes the GPT-4-Turbo in its test output prediction scenarios, highlighting its stronger natural language inference capabilities. Mistral-Large offers quite good performance for natural language inference tasks such as predicting test output and code execution.

Performance in four scenarios

How do I submit it?

You can follow these steps to evaluate your code model in LiveCodeBench

Environment Setup: Create a new environment using Condra and install the LiveCodeBench Git clone https://github.com/livecodebench/livecodebench.git cd livecodebench pip install poems and install the poems to evaluate the new hagging face model. -scenario {sinario_name}

For different scenarios. We have implemented an extensible framework for the new model family. You can support new models by modifying the LCB_RUNNER/LM_STYLES.PY and LCB_RUNNER/prompts as described in GITHUBREADME.

Once the results are generated, you can fill out this form and submit it.

How to contribute

Finally, I’m looking for livecodebench collaborators and suggestions. Datasets and codes are available online so please contact us by sending us any questions or emails.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleWhy Cognizant CEOs believe entry-level employment will grow
Next Article How to Use Midjourney Prompt for Professional Grade AI Images
versatileai

Related Posts

Tools

Introducing training clusters as a service

June 12, 2025
Tools

Mistral AI challenges big technology with inference models

June 11, 2025
Tools

Reddit sues humanity to train AI by cutting down user data

June 10, 2025
Add A Comment

Comments are closed.

Top Posts

Deepseek’s latest AI model is a “big step back” for free speech

May 31, 20255 Views

Doudna Supercomputer to Strengthen AI and Genomics Research

May 30, 20255 Views

From California to Kentucky: Tracking the rise of state AI laws in 2025 | White & Case LLP

May 29, 20255 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Deepseek’s latest AI model is a “big step back” for free speech

May 31, 20255 Views

Doudna Supercomputer to Strengthen AI and Genomics Research

May 30, 20255 Views

From California to Kentucky: Tracking the rise of state AI laws in 2025 | White & Case LLP

May 29, 20255 Views
Don't Miss

Introducing training clusters as a service

June 12, 2025

Qualcomm (QCOM) expands AI research at new centres in Vietnam

June 11, 2025

What is AI Cleaning and Why Businesses Should Stop Exaggerating AI Proficiency | Articles

June 11, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?