LLM-as-a-Judge has emerged as a popular way to score natural language output from LLM applications, but how do you know which model will make the best judge?
We’re excited to launch Judge Arena, a platform that makes it easy for anyone to compare models side-by-side as judges. Simply run the judges through your test samples and vote on which judge you agree with the most. The results are compiled into a leaderboard displaying the best judges.
judge arena
Crowdsourced randomized battles have proven effective for benchmarking LLM. LMSys’ Chatbot Arena has garnered over 2 million votes and is considered a field test for identifying the best language model. Because LLM evaluations aim to capture human preferences, direct human feedback is also key to determining which AI judges will be most helpful.
structure
Select a sample for evaluation: Let the system randomly generate 👩 user input / 🤖 AI response pairs, or enter your own custom sample Two LLM judges will: Score the response Explain the reason for the score
Check the ratings of both judges and vote for the one that best matches your judgment.
(We recommend checking the scores first before comparing critiques)
Every time you vote, you can:
Regenerate Judge: Get a new evaluation for the same sample 🎲 Start a new round: Randomly generate a new sample to evaluate or enter a new custom sample to evaluate
To avoid potential bias and abuse, model names will only be published after votes have been submitted.
selected model
Judge Arena focuses on the LLM-as-a-Judge approach, so it only includes generative models (excluding classifier models that only output scores). We formulate the selection criteria for AI judges as follows.
Models must have the ability to effectively score and critique the output of other models. The model should be able to be evaluated against different criteria and with different scoring formats.
We have selected 18 cutting-edge LLMs for our leaderboard. Many are open-source models with public weights, but we also include our own API models to allow direct comparison of open and closed approaches.
OpenAI (GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo) Anthropic (Claude 3.5 Sonnet / Haiku, Claude 3 Opus / Sonnet / Haiku) Meta (Llama 3.1 Instruct Turbo 405B / 70B / 8B) Alibaba (Qwen 2.5 Instruct Turbo ) 7B/72B, Kwen 2 Instruction 72B) Google (Gemma 2 9B / 27B) Mistral (Instruction v0.3 7B, Instruction v0.1 7B)
The current list represents the most commonly used models in AI evaluation pipelines. If you find our leaderboard useful, we look forward to adding more models.
leader board
Votes collected from Judges Arena will be tallied and displayed on a dedicated public leaderboard. We calculate Elo scores for each model and update the leaderboard every hour.
early insights
These are just very early results, but here’s what we’ve observed so far:
Proprietary and open source top performance combination: GPT-4 Turbo has a close lead, but Llama and Qwen models are very competitive and outperform most of the proprietary models Smaller models are better Performance: Qwen 2.5 7B and Llama 3.1 8B are performing very well and competing with much larger models. As we collect more data, we hope to be able to better understand the relationship between model size and decision power. Preliminary empirical support for new research: The LLM-as-a-Judge literature suggests that the Llama model is suitable as a base model and its out-of-the-box performance on evaluation benchmarks. Several approaches, including Lynx, Auto-J, and SFR-LLaMA-3.1-Judge, chose to start with the Llama model before post-training the evaluation function. Our preliminary results are consistent with this trend, showing that Llama 3.1 70B and 405B rank 2nd and 3rd, respectively.
We look forward to sharing further analysis of the results on our blog as the leaderboards take shape in the coming weeks.
How to contribute
We hope Judge Arena is a helpful resource for the community. By contributing to this leaderboard, you can help developers decide which models to use in their evaluation pipelines. We hope that developers, researchers, and users will leverage our findings to build a more connected rater, and we expect that over the next few months we will be adding more than 20% of anonymized voting data. % committed to sharing.
We would love to hear your thoughts. For general feature requests or to submit/suggest new models to add to Arena, open a discussion in the community tab or talk to us on Discord. If you have any questions or suggestions, feel free to send us a message on X/Twitter.
Atlas is currently funding this out of its own pocket. We’re looking for API credits (unconditional) to support this community effort – if you’re interested in collaborating, please contact us at support@atla-ai.com 🤗
credit
Thank you to everyone who helped test this arena and gave a shout out to the LMSYS team for inspiration. Special thanks to Clémentine Fourrier and the Hugging Face team for making this possible.