Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Utah has enacted AI fixes targeting mental health chatbots and generation AI | Sheppard Mullin Richter & Hampton LLP

May 19, 2025

The growing issues regarding social media AI

May 19, 2025

Introducing the Hebrew LLMS open leaderboard!

May 19, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Monday, May 19
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»FACTS Grounding: A new benchmark for assessing the factuality of large language models
Tools

FACTS Grounding: A new benchmark for assessing the factuality of large language models

By December 18, 2024No Comments5 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email

Responsibility and safety

Published December 17, 2024 Author

facts team

Our comprehensive benchmarks and online leaderboards provide a much-needed measure of how accurately LLMs root their responses to the source material provided and avoid hallucinations.

Large-scale language models (LLMs) are transforming the way we access information, but our grasp of factual accuracy remains incomplete. It can “hallucinate” false information, especially when given complex input. This can lead to a loss of confidence in LLM and limit its real-world application.

Today we introduce FACTS Grounding. This is used to evaluate the LLM’s ability to generate responses that are not only factually accurate with respect to the given input, but also have sufficient detail to provide a satisfactory answer to the user’s query. A comprehensive benchmark.

We hope our benchmarks will foster industry-wide progress on facts and evidence. We’re also starting a FACTS leaderboard on Kaggle to track your progress. We have already tested major LLMs using FACTS Grounding and entered their grounding scores into an initial leaderboard. Maintain and update leaderboards as the field progresses.

Current leaderboard ranking

Facts Grounding Dataset

To accurately assess the factuality and basis of a particular LLM, the FACTS Grounding dataset consists of 1,719 examples, each requiring a long-form answer based on the provided context documentation. It has been carefully crafted to ensure that. Each example consists of a document, a system instruction requesting that the LLM exclusively reference the provided document, and an accompanying user request.

FACTS Grounding dataset example

All examples are split into holdout sets: a “public” set (860) and a “private” set (859). Today, we are releasing a public set so that anyone can use it to assess the LLM. Of course, we know it’s important to prevent benchmark contamination and leaderboard hacking issues, so we follow standard industry practice and keep a private set of evaluations. The FACTS leaderboard score is the average performance of both public and private sets.

To ensure input diversity, FACTS Grounding examples include documents of varying lengths up to 32,000 tokens (approximately 20,000 words), covering fields such as finance, technology, retail, healthcare, and law. Included. User requests similarly range from summaries to Q&A creation to rewriting tasks. We have not included examples that may require creativity, mathematics, or complex reasoning, features that may require applying more advanced reasoning to the model in addition to the basics.

Immediate distribution

Comprehensive judgment by a major LLM

To be successful in a given example, an LLM must synthesize the complex information in a document and produce a long-form response that is a comprehensive answer to the user request and is fully attributable to the document.

FACTS Grounding automatically evaluates model responses using three Frontier LLM judges: Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet. We chose different combinations of judges to reduce the potential bias of judges giving higher scores to answers produced by members of their own model families. The automated decision model was comprehensively evaluated against a set of conducted tests to find the best performing decision prompt templates and confirm agreement with human raters.

Each FACTS grounding example is judged in two stages. First, responses will be evaluated for eligibility and will be disqualified if they do not satisfactorily meet the user’s request. Second, an answer is determined to be factually accurate if it is free from illusions and is based entirely on the information contained in the documents provided.

The eligibility and rationale accuracy of a particular LLM response is evaluated independently by multiple AI decision models, and the results are aggregated to determine whether the LLM successfully addressed the example. The final score for the overall grounding task is the average of the scores of all judge models across all examples. For more information on the FACTS ground evaluation method, please see the paper.

A factually correct response that fails to adequately address the user’s request will fail the benchmark example. Here are three examples of model responses that the automatic LLM judge disqualified.

FACTS Grounding continues to evolve

We keep in mind that benchmarks can be quickly overtaken by advances, so this release of the FACTS Grounding benchmark and leaderboard is just the beginning. Factuality and evidence are one of the key elements that will shape the future success and usefulness of LLM and broader AI systems, and we are committed to growing and iterating on FACTS Grounding as the field advances. , we aim to continually raise the bar.

We encourage the AI ​​community to join FACTS Grounding and evaluate models on our open sample set or submit models for evaluation. We believe that comprehensive benchmarking methods and continued research and development will continue to improve AI systems.

Acknowledgment

FACTS Grounding was led by Aron Jacoby, Andrew Wang, Chris Alberti, Connie Tao, Dipanjan Das, Jon Lipovets, Kate Olszewska, Lukas Haas, Michelle Liu, and Nate Keating .

We also greatly appreciate contributions from Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, and Rui. Thank you very much. Wang, Zhang Ziao, and Sasha Goldstein.

We would also like to thank Avinatan Hassidim, D. Sculley, Fernando Pereira, Koray Kavukcuoglu, Slav Petrov, Ya Xu, and Yossi Matias for their continued support.

author avatar
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleViral AI steak slicing video comparison shows how ahead Google is
Next Article Media must rethink their technological capabilities to compete in the AI ​​era

Related Posts

Tools

Introducing the Hebrew LLMS open leaderboard!

May 19, 2025
Tools

Subscribe to Enterprise Hub with your AWS account

May 19, 2025
Tools

Building cost-effective enterprise RAG applications using Intel Gaudi 2 and Intel Xeon

May 18, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

The UAE announces bold AI-led plans to revolutionize the law

April 22, 20253 Views

The UAE will use artificial intelligence to develop new laws

April 22, 20253 Views

New report on national security risks from weakened AI safety frameworks

April 22, 20253 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

The UAE announces bold AI-led plans to revolutionize the law

April 22, 20253 Views

The UAE will use artificial intelligence to develop new laws

April 22, 20253 Views

New report on national security risks from weakened AI safety frameworks

April 22, 20253 Views
Don't Miss

Utah has enacted AI fixes targeting mental health chatbots and generation AI | Sheppard Mullin Richter & Hampton LLP

May 19, 2025

The growing issues regarding social media AI

May 19, 2025

Introducing the Hebrew LLMS open leaderboard!

May 19, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?