Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

CAC has announced AI-powered business registration portal – thisdaylive

July 3, 2025

Research shows that AI can reduce global carbon emissions

July 3, 2025

AI Art Challenge: Everyday Giants will showcase the creativity of AI generated in 2025 | AI News Details

July 2, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Thursday, July 3
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Research»If AI reasoning doesn’t work: Microsoft Research shows that more tokens can mean more problems
Research

If AI reasoning doesn’t work: Microsoft Research shows that more tokens can mean more problems

versatileaiBy versatileaiApril 15, 2025No Comments7 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

Join our daily and weekly newsletter for the latest updates and exclusive content on industry-leading AI coverage. learn more

Large-scale linguistic models (LLMS) are increasingly capable of complex inference through “scaling at inference.” This is a set of techniques that allocate more computational resources during inference to generate answers. However, a new study from Microsoft Research reveals that the effectiveness of these scaling methods is not universal. Performance improvements vary widely depending on the different models, tasks, and complexity of the problem.

The core finding is that throwing more calculations in the problem during inference won’t guarantee better or more efficient results. The findings help businesses to better understand cost volatility and model reliability as they try to integrate advanced AI inference into their applications.

Place the scaling method in the test

The Microsoft Research team conducted extensive empirical analysis on nine cutting-edge basic models. This includes both “traditional” models such as the GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Pro, and Llama 3.1 405b, as well as models specifically fine-tuned for enhanced inference through inference time scaling. These include Openai’s O1 and O3-Mini, Anthropic’s Claude 3.7 Sonnet, Google’s Gemini 2 Flash Thinking, and Deepseek R1.

They evaluated these models using three different inference time scaling approaches.

Standard Chain of Share (COT): The basic way in which the model is asked to answer step-by-step. Parallel Scaling: The model generates multiple independent answers for the same question and uses an aggregator (such as a selection of majority votes or best-scoring answers) to reach the final result. Sequential scaling: The model repeatedly generates answers and uses feedback from critics (potentially from the model itself) to improve the answers in subsequent attempts.

These approaches were tested with eight challenging benchmark datasets covering a wide range of tasks that benefit from mathematics and STEM inference (AIME, OMNI-MATH, GPQA), calendar planning (BA calendar), NP-hard problems (3SAT, TSP), and spatial inference (spatial maps): mathematics and STEM inference.

Some benchmarks include various difficulty levels of problems, giving us a more nuanced understanding of how scaling behaves as the problems become more difficult.

“The availability of difficulty tags for Omni-Math, TSP, 3SAT, and BA-Calendar allows you to analyze accuracy and token usage scale with difficulty in inference time scaling.

The researchers evaluated the Pareto frontier of LLM inference by analyzing both accuracy and computational cost (i.e., the number of tokens generated). This helps you identify how efficiently the model achieves the results.

Inference Time Scaling Parate
Inference Time Scaling Pareto Frontier Credit arxiv

They also introduce a “traditional gap” measure that compares the possible performance of possible performance from traditional models (using the ideal “best N” selection) with the average performance of the inference model, and estimates potential benefits that can be achieved through better training or validation techniques.

More calculations aren’t always the answer

This study provided some important insights that challenged general assumptions about inference time scaling.

The advantages vary widely. These tasks generally tune models for inference more tuned than traditional tasks, but the extent of improvement varies widely depending on the particular domain and task. As the complexity of the problem increases, profits often decrease. For example, the performance improvements seen in mathematical problems were not always equally translated into scientific reasoning and planning tasks.

The inefficiency of tokens is intense. Researchers observed high variability in token consumption, even among models that achieve similar accuracy. For example, in the AIME 2025 Math Benchmark, Deepseek-R1 used more than five times the tokens of the Claude 3.7 sonnet to obtain roughly comparable average accuracy.

More tokens do not lead to higher accuracy: Contrary to the intuitive idea that longer inference chains mean better inference, this study found that this is not always true. “Amazingly, we also observe that longer generations compared to the same model can be indicators of models struggling, rather than improvements in reflexes,” the paper states. “Similarly, when comparing different inference models, when there are high token usages, they are not always associated with better accuracy. These findings motivate the need for a more purposeful and cost-effective scaling approach.”

Non-corticalism of cost: Perhaps the most concern for enterprise users, repeating queries to the same model for the same problem can result in highly variable token use. This means that even if the model consistently provides the correct answer, the cost of running a query can fluctuate significantly.

Dispersion of model output
Response length variance spikes indicate small variance Credit arxiv

Validation Mechanism Possibility: Scaling performance was consistently improved across all models and benchmarks when simulated with a “complete validation agent” (using optimal results).

The traditional model may match the inference model. By significantly increasing inference calls (up to 50x or more in some experiments), traditional models such as GPT-4o may approach the performance level of dedicated inference models, especially for less complex tasks. However, these benefits drop rapidly in very complex settings, indicating that brute force scaling is limited.

GPT-4O Inference Time Scaling
In some tasks the accuracy of GPT 4O continues to improve with parallel and continuous scaling Credit arxiv

Impact on companies

These findings carry considerable weight for developers and LLMS company employers. The “non-deterministic” issue is particularly severe, making budgeting difficult. As researchers point out, “Ideally, developers and users would prefer a model with a low standard deviation regarding token use per instance in the case of cost predictability.”

“The profiling done in Microsoft Research can be useful for developers as a tool to select models with less volatile for the same or different prompts. “Ideally, you’d want to choose models with lower standard deviations for the correct input.”

A model peaking to the left consistently generates the same number of tokens for the given task credit arxiv

This study also provides appropriate insight into the correlation between model accuracy and response length. For example, the following diagram shows that the possibility that mathematics is correct beyond the length of ~11,000 tokens is very slim, and those generations must either stop at that point or restart with some sequential feedback. However, Nushi points out that these models that allow post-relaxation also have clean separations between the correct and incorrect samples.

“Ultimately, it’s the responsibility of the model builder, and it’s a non-critical cost to think about reducing accuracy and costs, and we expect this to happen as this method matures more,” Nusi said. “Along with costless non-determinism, non-determinism is also applied.”

Another important finding is a consistent performance boost from a complete validation agent that highlights key areas of future work. This is the construction of a robust and widely applicable verification mechanism.

“The availability of stronger validators can have different types of impact,” Nushi said. “If used efficiently, these can shorten traces of inference.”

Powerful validation agents are also a central part of your enterprise agent AI solution. Many company stakeholders have already implemented such validation agents and need to be reused for more agent solutions such as SAT solvers, logistic validity checkers, etc.

“The future question is how these existing techniques can be combined with AI-driven interfaces, and what language does the two connect?” Nushi said. “The need to connect the two comes from the fact that users don’t always formulate queries in a formal way, and you’ll want to use a natural language interface and expect a solution in a similar format or final action (for example, suggesting a meeting invitation).”

Daily insights into business use cases in VB every day

If you want to impress your boss, VB Daily has it covered. From regulatory shifts to actual deployments, it provides an internal scoop on what companies are doing with generated AI, allowing you to share the biggest ROI insights.

Please read our privacy policy

Thank you for subscribing. Check out this VB newsletter.

An error has occurred.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous Article#Workforcewednesday®: Artificial Intelligence Regulations for Employers – Employment Law of the Week® | Epstein Becker & Green
Next Article AI Now: “On the cusp of doing new science”
versatileai

Related Posts

Research

Lossless compression tailored to AI

June 30, 2025
Research

High-tech research jobs in the US will rise by 26% by the next decade. Median future salary for AI, ML and others is $140,000

June 30, 2025
Research

BAU researchers are first to use AI to combat brucellosis

June 30, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

Impact International | EU AI ACT Enforcement: Business Transparency and Human Rights Impact in 2025

June 2, 20251 Views

Presight plans to expand its AI business internationally

April 14, 20251 Views

PlanetScale Vectors GA: MySQL and AI Database Game Changer

April 14, 20251 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Impact International | EU AI ACT Enforcement: Business Transparency and Human Rights Impact in 2025

June 2, 20251 Views

Presight plans to expand its AI business internationally

April 14, 20251 Views

PlanetScale Vectors GA: MySQL and AI Database Game Changer

April 14, 20251 Views
Don't Miss

CAC has announced AI-powered business registration portal – thisdaylive

July 3, 2025

Research shows that AI can reduce global carbon emissions

July 3, 2025

AI Art Challenge: Everyday Giants will showcase the creativity of AI generated in 2025 | AI News Details

July 2, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?