Responsibility and safety
Published December 17, 2024 Author
facts team
Our comprehensive benchmarks and online leaderboards provide a much-needed measure of how accurately LLMs root their responses to the source material provided and avoid hallucinations.
Large-scale language models (LLMs) are transforming the way we access information, but our grasp of factual accuracy remains incomplete. It can “hallucinate” false information, especially when given complex input. This can lead to a loss of confidence in LLM and limit its real-world application.
Today we introduce FACTS Grounding. This is used to evaluate the LLM’s ability to generate responses that are not only factually accurate with respect to the given input, but also have sufficient detail to provide a satisfactory answer to the user’s query. A comprehensive benchmark.
We hope our benchmarks will foster industry-wide progress on facts and evidence. We’re also starting a FACTS leaderboard on Kaggle to track your progress. We have already tested major LLMs using FACTS Grounding and entered their grounding scores into an initial leaderboard. Maintain and update leaderboards as the field progresses.
Current leaderboard ranking
Facts Grounding Dataset
To accurately assess the factuality and basis of a particular LLM, the FACTS Grounding dataset consists of 1,719 examples, each requiring a long-form answer based on the provided context documentation. It has been carefully crafted to ensure that. Each example consists of a document, a system instruction requesting that the LLM exclusively reference the provided document, and an accompanying user request.
FACTS Grounding dataset example
All examples are split into holdout sets: a “public” set (860) and a “private” set (859). Today, we are releasing a public set so that anyone can use it to assess the LLM. Of course, we know it’s important to prevent benchmark contamination and leaderboard hacking issues, so we follow standard industry practice and keep a private set of evaluations. The FACTS leaderboard score is the average performance of both public and private sets.
To ensure input diversity, FACTS Grounding examples include documents of varying lengths up to 32,000 tokens (approximately 20,000 words), covering fields such as finance, technology, retail, healthcare, and law. Included. User requests similarly range from summaries to Q&A creation to rewriting tasks. We have not included examples that may require creativity, mathematics, or complex reasoning, features that may require applying more advanced reasoning to the model in addition to the basics.
Immediate distribution
Comprehensive judgment by a major LLM
To be successful in a given example, an LLM must synthesize the complex information in a document and produce a long-form response that is a comprehensive answer to the user request and is fully attributable to the document.
FACTS Grounding automatically evaluates model responses using three Frontier LLM judges: Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet. We chose different combinations of judges to reduce the potential bias of judges giving higher scores to answers produced by members of their own model families. The automated decision model was comprehensively evaluated against a set of conducted tests to find the best performing decision prompt templates and confirm agreement with human raters.
Each FACTS grounding example is judged in two stages. First, responses will be evaluated for eligibility and will be disqualified if they do not satisfactorily meet the user’s request. Second, an answer is determined to be factually accurate if it is free from illusions and is based entirely on the information contained in the documents provided.
The eligibility and rationale accuracy of a particular LLM response is evaluated independently by multiple AI decision models, and the results are aggregated to determine whether the LLM successfully addressed the example. The final score for the overall grounding task is the average of the scores of all judge models across all examples. For more information on the FACTS ground evaluation method, please see the paper.
A factually correct response that fails to adequately address the user’s request will fail the benchmark example. Here are three examples of model responses that the automatic LLM judge disqualified.
FACTS Grounding continues to evolve
We keep in mind that benchmarks can be quickly overtaken by advances, so this release of the FACTS Grounding benchmark and leaderboard is just the beginning. Factuality and evidence are one of the key elements that will shape the future success and usefulness of LLM and broader AI systems, and we are committed to growing and iterating on FACTS Grounding as the field advances. , we aim to continually raise the bar.
We encourage the AI community to join FACTS Grounding and evaluate models on our open sample set or submit models for evaluation. We believe that comprehensive benchmarking methods and continued research and development will continue to improve AI systems.
Acknowledgment
FACTS Grounding was led by Aron Jacoby, Andrew Wang, Chris Alberti, Connie Tao, Dipanjan Das, Jon Lipovets, Kate Olszewska, Lukas Haas, Michelle Liu, and Nate Keating .
We also greatly appreciate contributions from Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, and Rui. Thank you very much. Wang, Zhang Ziao, and Sasha Goldstein.
We would also like to thank Avinatan Hassidim, D. Sculley, Fernando Pereira, Koray Kavukcuoglu, Slav Petrov, Ya Xu, and Yossi Matias for their continued support.