FACTS Benchmark Suite: A new way to systematically assess the factuality of LLMs

Large-scale language models (LLMs) are becoming the primary source of information delivery across a variety of use cases, so it’s important that their responses are factually accurate.

To continue to improve performance against this industry-wide challenge, we need to better understand the types of use cases in which models struggle to provide accurate responses, and better measure factual performance in those areas.

FACTS Benchmark Suite

Today, we’re teaming up with Kaggle to introduce the FACTS Benchmark Suite. It extends our previous work developing the FACTS Grounding Benchmark and adds three additional factuality benchmarks:

A parametric benchmark that measures a model’s ability to accurately access internal knowledge in the factoid question use case. A search benchmark that tests a model’s ability to use search as a tool to retrieve and correctly synthesize information. A multimodal benchmark that tests a model’s ability to answer prompts related to input images in a factually correct manner.

We’re also updating the original FACTS Grounding Benchmark with Grounding Benchmark – v2, an enhanced benchmark for testing a model’s ability to provide answers based on the context of a specific prompt.

Each benchmark was carefully curated, resulting in a total of 3,513 examples and published today. As with previous releases, we follow standard industry practice and keep evaluation sets as private sets. The FACTS benchmark suite score (or FACTS score) is calculated as the average accuracy of both public and private sets across the four benchmarks. Kaggle oversees the management of the FACTS Benchmark Suite. This includes owning private holdout sets, testing key LLMs on benchmarks, and hosting results on public leaderboards. For more information on the FACTS evaluation methodology, please see our technical report.

Benchmark overview

parametric benchmark

The FACTS parametric benchmark evaluates a model’s ability to accurately answer fact-based questions without the aid of external tools such as web searches. All benchmark questions are “trivia-style” questions based on user interests and can be answered through Wikipedia (a standard source for LLM pre-training). The resulting benchmark consists of a public set of 1052 items and a private set of 1052 items.

versatileai

See Full Bio

What's Hot

Introducing Lyria 3.5 to Google Flow Music

How AI is shortening drug discovery timelines in China

Introducing real-time generative simulation to surgical robotics

Introducing Lyria 3.5 to Google Flow Music

How AI is shortening drug discovery timelines in China

Introducing real-time generative simulation to surgical robotics

New in llama.cpp: Model Management

OpenAI pushes ChatGPT to patient health records

SenseTime’s Galaxy project aims to scale up domestic AI chips

Most Popular

New in llama.cpp: Model Management

OpenAI pushes ChatGPT to patient health records

SenseTime’s Galaxy project aims to scale up domestic AI chips

Don't Miss

Introducing Lyria 3.5 to Google Flow Music

How AI is shortening drug discovery timelines in China

Introducing real-time generative simulation to surgical robotics

Subscribe to Updates

What's Hot

FACTS Benchmark Suite: A new way to systematically assess the factuality of LLMs

FACTS Benchmark Suite

Benchmark overview

parametric benchmark

Related Posts