Flaws in AI benchmarks put company budgets at risk

A new academic review suggests that AI benchmarks are flawed and could lead to companies making high-stakes decisions on “misleading” data.

Business leaders are committing eight to nine-figure budgets to generative AI programs. These procurement and development decisions often rely on public leaderboards and benchmarks to compare model features.

The large-scale study, “Measuring What Matters: Building Validity for Large-Scale Language Model Benchmarks,” analyzed 445 individual LLM benchmarks from major AI conferences. The team of 29 expert reviewers found that “almost every paper had weaknesses in at least one area”, undermining claims about the model’s performance.

For CTOs and chief data officers, this is at the heart of their AI governance and investment strategy. If benchmarks that claim to measure “safety” or “robustness” don’t actually capture those qualities, organizations may deploy models that expose them to serious financial and reputational risks.

“Configuration Validity” Issues

The researchers focused on a core scientific principle known as construct validity. Simply put, this is how well a test measures the abstract concept it purports to measure.

For example, “intelligence” cannot be measured directly, but tests are created that serve as measurable surrogates. The paper states that if a benchmark has low construct validity, “high scores may be irrelevant or even misleading.”

This problem is widespread in AI evaluation. The study found that key concepts were often “poorly defined or operationalized.” This can lead to “poorly supported scientific claims, misguided research, and policy influences that are not based on solid evidence.”

When vendors compete for enterprise contracts by emphasizing top benchmark scores, leaders are effectively trusting these scores to be reliable indicators of real-world business performance. This new research suggests that trust may be in the wrong place.

Where enterprise AI benchmarks are failing

The review identified flaws throughout the system, from how benchmarks were designed to how results were reported.

Vague or disputed definitions: You can’t measure what you can’t define. The study found that even when a definition of a phenomenon is provided, 47.8% “disagree” with a concept that has “many possible definitions or no clear definition at all.”

This paper takes ‘non-maleficence’, a key goal in corporate safety coordination, as an example of a phenomenon that often lacks a clear and agreed-upon definition. If two vendors score differently on a “harmless” benchmark, it may simply reflect two different arbitrary definitions of the term rather than a true difference in the safety of their models.

Lack of statistical rigor: Perhaps most worrying for data-driven organizations, the review found that only 16% of the 445 benchmarks used uncertainty estimates or statistical tests to compare model results.

Without statistical analysis, it is impossible to know whether Model A’s 2% lead over Model B is a true difference in ability or simply a coincidence. Corporate decisions are being made based on numbers that would not pass basic scientific or business intelligence review.

Data contamination and memory: Many benchmarks, especially those for inference (such as the widely used GSM8K), are compromised when questions and answers appear in the model’s pre-training data.

When this happens, the model is not doing any inference to find the answer. It’s just memorization. A high score may indicate good memory rather than the advanced reasoning ability that companies actually need for complex tasks. The paper warns that this “undermines the validity of the results” and recommends building taint checks directly into benchmarks.

Unrepresentative datasets: The study found that 27% of benchmarks used “convenience sampling,” such as reusing data from existing benchmarks or human testing. This data often does not represent real-world phenomena.

For example, the authors point out that reusing “calculator-free exam” questions means using numbers chosen to facilitate basic arithmetic in the questions. While the model may score well on this test, this score “does not predict performance at larger numbers, where LLMs struggle.” This creates significant blind spots and hides known model weaknesses.

From public indicators to internal verification

For business leaders, this study is a strong warning that public AI benchmarks are not a substitute for internal and domain-specific assessments. A high score on a public leaderboard does not guarantee suitability for any particular business purpose.

Isabella Grandi, Director of Data Strategy and Governance at NTT Data UK&I, commented: “A single benchmark may not be the right way to capture the complexity of AI systems, and expecting to do so risks reducing progress to a numbers game rather than a measure of real-world responsibility. What is most important is consistent assessment against clear principles that ensure technology not only progresses but also serves people.”

“ISO/IEC Good methodology as defined in 42001:2023 reflects this balance through five core principles: accountability, fairness, transparency, security, and redress. Accountability establishes ownership and responsibility for the AI systems deployed, making them ethical and accountable. Security and privacy are non-negotiable, prevent abuse and strengthen public trust. Remedies and appeals provide an important mechanism for society to monitor and allow people to challenge and correct outcomes if necessary.

“True progress in AI will depend on collaboration that brings together the vision of governments, the curiosity of academia, and the practical drivers of industry. When partnerships are underpinned by open dialogue and common standards are established, they build the transparency needed to instill trust in AI systems for people. Responsible innovation will always depend on collaboration that strengthens oversight while maintaining ambition.”

The eight recommendations in this document provide a practical checklist for companies considering building their own internal AI benchmarks and assessments that align with a principles-based approach.

Define the phenomenon: Before testing a model, an organization must first create a “precise and operational definition of the phenomenon being measured.” What does a “useful” response mean from a customer service perspective? What does “accurate” mean for financial reporting? Build representative data sets: The most valuable benchmarks are those built from your own data. The paper advises developers to “build representative datasets appropriate to the task.” This means using task items that reflect real-life scenarios, formats, and challenges faced by employees and customers. Conduct error analysis: Go beyond the final score. The report recommends the team “conduct qualitative and quantitative analysis of common failure modes.” Analyzing why a model fails is more useful than simply knowing the score. It might be acceptable if the failures were all about low-priority, obscure topics. If you fail in the most common and high-value use cases, that one score becomes irrelevant. Justify Relevance: Finally, the team must “justify the relevance of the benchmark to real-world applications for the phenomenon.” All evaluations should be accompanied by a clear rationale that explains why this particular test is a valid proxy for business value.

The race to adopt generative AI is forcing organizations to move faster than their governance frameworks can keep up. This report shows that the very tools used to measure progress are often flawed. The only sure path forward is to stop relying on popular AI benchmarks and start “measuring what matters” to your company.

SEE ALSO: OpenAI spreads $600 billion bet on cloud AI across AWS, Oracle, Microsoft

Want to learn more about AI and big data from industry leaders? Check out the AI & Big Data Expos in Amsterdam, California, and London. This comprehensive event is part of TechEx and co-located with other major technology events such as Cyber Security Expo. Click here for more information.

AI News is brought to you by TechForge Media. Learn about other upcoming enterprise technology events and webinars.

versatileai

See Full Bio

What's Hot

Can research agents keep secrets?

Computer vision helps retailers improve productivity

Automate council planning tasks with Google Cloud-generated AI

Can research agents keep secrets?

Computer vision helps retailers improve productivity

Automate council planning tasks with Google Cloud-generated AI

Huawei fills the AI gap left in China by Apple

Xebia: Why AI agents fail without the right data foundation

Trends and insights with new multilingual and long-form tracks

Most Popular

Huawei fills the AI gap left in China by Apple

Xebia: Why AI agents fail without the right data foundation

Trends and insights with new multilingual and long-form tracks

Don't Miss

Can research agents keep secrets?

Computer vision helps retailers improve productivity

Automate council planning tasks with Google Cloud-generated AI

Subscribe to Updates

What's Hot

Flaws in AI benchmarks put company budgets at risk

“Configuration Validity” Issues

Where enterprise AI benchmarks are failing

From public indicators to internal verification

Related Posts