Introducing the LiveCodeBench Leaderboard – Overall and Contaminated Assessment of Code LLM

We look forward to introducing the LiveCodebench leaderboard based on LiveCodebench, a new benchmark developed by researchers at UC Berkeley, MIT, and Cornell, to measure the code generation capabilities of LLMS.

LiveCodebench collects long-term coding issues from various coding contest platforms and annotates issues on release dates. Annotations are used to evaluate models of problem sets released in different time windows, allowing for a “assessment over time” strategy that helps detect and prevent contamination. In addition to the usual code generation tasks, LiveCodebench also evaluates self-healing, test output prediction, and code execution, providing a more overall view of the coding capabilities needed by next-generation AI programming agents.

LiveCodeBench Scenarios and Evaluations

The LiveCodebench issues are curated from the coding of competitive platforms (LeetCode, Atcoder, and CodeForces). These websites regularly host contests that contain questions that evaluate participants’ coding and problem solving skills. The problem consists of a natural language problem statement and an example I/O example, with the goal being to create a program that passes a series of hidden tests. With thousands of participants engaged in the sport, it is guaranteed that the issues are examined for clarity and accuracy.

LiveCodeBench uses problems collected to build four coding scenarios

Code generation. This model is given a problem statement containing a natural language description and sample tests (input output pairs) and is subject to generation of the correct solution. Evaluations are based on the functional correctness of the generated code and are determined using a set of test cases. Self-repair. This model is given a problem statement and generates a candidate program, similar to the code generation scenario above. In the event of a mistake, the model will provide error feedback (either an exception message or a faulty test case) and will be forced to generate a correction. The evaluation is performed using the same functional correctness as above. Code execution. This model provides a program snippet consisting of functions (f) along with test input, with the challenge of predicting the output of the program in the input test case. The evaluation is based on run-based accuracy metrics. Assertion Assert f(input) == generated_output path, the output of the model is considered correct. Test output prediction. This model is given a problem statement along with the test case input, and is imposed to generate the expected output of the input. Tests are generated only from the statement in question without the need for implementation of the function, and the output is evaluated using an exact match checker.

For each scenario, an evaluation is performed using the Pass@1 metric. The metric is calculated using the ratio of the correct answer counts in the counts of total trials, capturing the probability of generating a correct answer.

Preventing benchmark contamination

Contamination is one of the major bottlenecks in current LLM assessments. Even within LLM coding assessments, there are reports of evidence of contamination and overfitting in standard benchmarks such as Humanval ((1) and (2)).

For this reason, annotate the issue with the release date issue in LiveCodebench. This way, for new models with training cutoff date D, you can calculate the scores for problems released after D to measure the generalization of invisible problems.

LiveCodeBench formalizes this with the “Scroll over time” feature that allows you to select questions within a specific time window. You can try it out on the leaderboard above!

Survey results

We find it:

Although model performance is correlated in various scenarios, relative performance and ordering may differ in four scenarios using the GPT-4-turbo. Furthermore, its margin increases in self-healing tasks, highlighting the ability to obtain compiler feedback. Claude-3-Opus overtakes the GPT-4-Turbo in its test output prediction scenarios, highlighting its stronger natural language inference capabilities. Mistral-Large offers quite good performance for natural language inference tasks such as predicting test output and code execution.

How do I submit it?

You can follow these steps to evaluate your code model in LiveCodeBench

Environment Setup: Create a new environment using Condra and install the LiveCodeBench Git clone https://github.com/livecodebench/livecodebench.git cd livecodebench pip install poems and install the poems to evaluate the new hagging face model. -scenario {sinario_name}

For different scenarios. We have implemented an extensible framework for the new model family. You can support new models by modifying the LCB_RUNNER/LM_STYLES.PY and LCB_RUNNER/prompts as described in GITHUBREADME.

Once the results are generated, you can fill out this form and submit it.

How to contribute

Finally, I’m looking for livecodebench collaborators and suggestions. Datasets and codes are available online so please contact us by sending us any questions or emails.

versatileai

See Full Bio

What's Hot

Pocket FM and OpenAI partner on content production: Rediff Moneynews

Gemini 2.5 Pro Preview: Even better coding performance

Build physical AI using virtual simulation data

Gemini 2.5 Pro Preview: Even better coding performance

Build physical AI using virtual simulation data

How NVIDIA builds open data for AI

Gemini’s Security Safeguard Advance – Google DeepMind

Wix Get 1 hour to expand generative AI capabilities and accelerate product innovation – TradingView News

Competitive programming with AlphaCode-Google Deepmind

Most Popular

Gemini’s Security Safeguard Advance – Google DeepMind

Wix Get 1 hour to expand generative AI capabilities and accelerate product innovation – TradingView News

Competitive programming with AlphaCode-Google Deepmind

Don't Miss

Pocket FM and OpenAI partner on content production: Rediff Moneynews

Gemini 2.5 Pro Preview: Even better coding performance

Build physical AI using virtual simulation data

Subscribe to Updates

What's Hot

Introducing the LiveCodeBench Leaderboard – Overall and Contaminated Assessment of Code LLM

LiveCodeBench Scenarios and Evaluations

Preventing benchmark contamination

Survey results

How do I submit it?

How to contribute

Related Posts