Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Workplace AI Series – Part 3: Artificial Intelligence in Employment: How States Around Pennsylvania Are Near Legal Situation | Tucker Aresberg, PC

June 4, 2025

AI-Media announces innovative AI voice translation at NAB Show 2025

June 4, 2025

Gemini 2.5 native audio features

June 4, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Wednesday, June 4
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»Judge Arena: Benchmarking the LLM as an Assessor
Tools

Judge Arena: Benchmarking the LLM as an Assessor

By November 24, 2024No Comments5 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email

LLM-as-a-Judge has emerged as a popular way to score natural language output from LLM applications, but how do you know which model will make the best judge?

We’re excited to launch Judge Arena, a platform that makes it easy for anyone to compare models side-by-side as judges. Simply run the judges through your test samples and vote on which judge you agree with the most. The results are compiled into a leaderboard displaying the best judges.

judge arena

Crowdsourced randomized battles have proven effective for benchmarking LLM. LMSys’ Chatbot Arena has garnered over 2 million votes and is considered a field test for identifying the best language model. Because LLM evaluations aim to capture human preferences, direct human feedback is also key to determining which AI judges will be most helpful.

structure

Select a sample for evaluation: Let the system randomly generate πŸ‘© user input / πŸ€– AI response pairs, or enter your own custom sample Two LLM judges will: Score the response Explain the reason for the score

Check the ratings of both judges and vote for the one that best matches your judgment.

(We recommend checking the scores first before comparing critiques)

Every time you vote, you can:

Regenerate Judge: Get a new evaluation for the same sample 🎲 Start a new round: Randomly generate a new sample to evaluate or enter a new custom sample to evaluate

To avoid potential bias and abuse, model names will only be published after votes have been submitted.

selected model

Judge Arena focuses on the LLM-as-a-Judge approach, so it only includes generative models (excluding classifier models that only output scores). We formulate the selection criteria for AI judges as follows.

Models must have the ability to effectively score and critique the output of other models. The model should be able to be evaluated against different criteria and with different scoring formats.

We have selected 18 cutting-edge LLMs for our leaderboard. Many are open-source models with public weights, but we also include our own API models to allow direct comparison of open and closed approaches.

OpenAI (GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo) Anthropic (Claude 3.5 Sonnet / Haiku, Claude 3 Opus / Sonnet / Haiku) Meta (Llama 3.1 Instruct Turbo 405B / 70B / 8B) Alibaba (Qwen 2.5 Instruct Turbo ) 7B/72B, Kwen 2 Instruction 72B) Google (Gemma 2 9B / 27B) Mistral (Instruction v0.3 7B, Instruction v0.1 7B)

The current list represents the most commonly used models in AI evaluation pipelines. If you find our leaderboard useful, we look forward to adding more models.

leader board

Votes collected from Judges Arena will be tallied and displayed on a dedicated public leaderboard. We calculate Elo scores for each model and update the leaderboard every hour.

early insights

These are just very early results, but here’s what we’ve observed so far:

Proprietary and open source top performance combination: GPT-4 Turbo has a close lead, but Llama and Qwen models are very competitive and outperform most of the proprietary models Smaller models are better Performance: Qwen 2.5 7B and Llama 3.1 8B are performing very well and competing with much larger models. As we collect more data, we hope to be able to better understand the relationship between model size and decision power. Preliminary empirical support for new research: The LLM-as-a-Judge literature suggests that the Llama model is suitable as a base model and its out-of-the-box performance on evaluation benchmarks. Several approaches, including Lynx, Auto-J, and SFR-LLaMA-3.1-Judge, chose to start with the Llama model before post-training the evaluation function. Our preliminary results are consistent with this trend, showing that Llama 3.1 70B and 405B rank 2nd and 3rd, respectively.

We look forward to sharing further analysis of the results on our blog as the leaderboards take shape in the coming weeks.

How to contribute

We hope Judge Arena is a helpful resource for the community. By contributing to this leaderboard, you can help developers decide which models to use in their evaluation pipelines. We hope that developers, researchers, and users will leverage our findings to build a more connected rater, and we expect that over the next few months we will be adding more than 20% of anonymized voting data. % committed to sharing.

We would love to hear your thoughts. For general feature requests or to submit/suggest new models to add to Arena, open a discussion in the community tab or talk to us on Discord. If you have any questions or suggestions, feel free to send us a message on X/Twitter.

Atlas is currently funding this out of its own pocket. We’re looking for API credits (unconditional) to support this community effort – if you’re interested in collaborating, please contact us at support@atla-ai.com πŸ€—

credit

Thank you to everyone who helped test this arena and gave a shout out to the LMSYS team for inspiration. Special thanks to ClΓ©mentine Fourrier and the Hugging Face team for making this possible.

author avatar
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleArtist Rights Alliance’s Annie Lennox speaks out on anti-deepfakes campaign
Next Article Poll finds widespread popularity of CA SB1047, which regulates AI

Related Posts

Tools

Gemini 2.5 native audio features

June 4, 2025
Tools

IBM and Roche use AI to predict blood glucose levels

June 3, 2025
Tools

Jacks of all trades, some masters, multipurpose trans agent

June 3, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

New Star: Discover why λ³΄λ‹ˆ is the future of AI art

February 26, 20253 Views

How to use Olympic coders locally for coding

March 21, 20252 Views

SmolVLM miniaturization – now available in 256M and 500M models!

January 23, 20252 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

New Star: Discover why λ³΄λ‹ˆ is the future of AI art

February 26, 20253 Views

How to use Olympic coders locally for coding

March 21, 20252 Views

SmolVLM miniaturization – now available in 256M and 500M models!

January 23, 20252 Views
Don't Miss

Workplace AI Series – Part 3: Artificial Intelligence in Employment: How States Around Pennsylvania Are Near Legal Situation | Tucker Aresberg, PC

June 4, 2025

AI-Media announces innovative AI voice translation at NAB Show 2025

June 4, 2025

Gemini 2.5 native audio features

June 4, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?