Large-scale language models (LLMs) are becoming the primary source of information delivery across a variety of use cases, so it’s important that their responses are factually accurate.
To continue to improve performance against this industry-wide challenge, we need to better understand the types of use cases in which models struggle to provide accurate responses, and better measure factual performance in those areas.
FACTS Benchmark Suite
Today, we’re teaming up with Kaggle to introduce the FACTS Benchmark Suite. It extends our previous work developing the FACTS Grounding Benchmark and adds three additional factuality benchmarks:
A parametric benchmark that measures a model’s ability to accurately access internal knowledge in the factoid question use case. A search benchmark that tests a model’s ability to use search as a tool to retrieve and correctly synthesize information. A multimodal benchmark that tests a model’s ability to answer prompts related to input images in a factually correct manner.
We’re also updating the original FACTS Grounding Benchmark with Grounding Benchmark – v2, an enhanced benchmark for testing a model’s ability to provide answers based on the context of a specific prompt.
Each benchmark was carefully curated, resulting in a total of 3,513 examples and published today. As with previous releases, we follow standard industry practice and keep evaluation sets as private sets. The FACTS benchmark suite score (or FACTS score) is calculated as the average accuracy of both public and private sets across the four benchmarks. Kaggle oversees the management of the FACTS Benchmark Suite. This includes owning private holdout sets, testing key LLMs on benchmarks, and hosting results on public leaderboards. For more information on the FACTS evaluation methodology, please see our technical report.
Benchmark overview
parametric benchmark
The FACTS parametric benchmark evaluates a model’s ability to accurately answer fact-based questions without the aid of external tools such as web searches. All benchmark questions are “trivia-style” questions based on user interests and can be answered through Wikipedia (a standard source for LLM pre-training). The resulting benchmark consists of a public set of 1052 items and a private set of 1052 items.

