Join our daily and weekly newsletter for the latest updates and exclusive content on industry-leading AI coverage. learn more
Large-scale linguistic models (LLMS) are increasingly capable of complex inference through “scaling at inference.” This is a set of techniques that allocate more computational resources during inference to generate answers. However, a new study from Microsoft Research reveals that the effectiveness of these scaling methods is not universal. Performance improvements vary widely depending on the different models, tasks, and complexity of the problem.
The core finding is that throwing more calculations in the problem during inference won’t guarantee better or more efficient results. The findings help businesses to better understand cost volatility and model reliability as they try to integrate advanced AI inference into their applications.
Place the scaling method in the test
The Microsoft Research team conducted extensive empirical analysis on nine cutting-edge basic models. This includes both “traditional” models such as the GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Pro, and Llama 3.1 405b, as well as models specifically fine-tuned for enhanced inference through inference time scaling. These include Openai’s O1 and O3-Mini, Anthropic’s Claude 3.7 Sonnet, Google’s Gemini 2 Flash Thinking, and Deepseek R1.
They evaluated these models using three different inference time scaling approaches.
Standard Chain of Share (COT): The basic way in which the model is asked to answer step-by-step. Parallel Scaling: The model generates multiple independent answers for the same question and uses an aggregator (such as a selection of majority votes or best-scoring answers) to reach the final result. Sequential scaling: The model repeatedly generates answers and uses feedback from critics (potentially from the model itself) to improve the answers in subsequent attempts.
These approaches were tested with eight challenging benchmark datasets covering a wide range of tasks that benefit from mathematics and STEM inference (AIME, OMNI-MATH, GPQA), calendar planning (BA calendar), NP-hard problems (3SAT, TSP), and spatial inference (spatial maps): mathematics and STEM inference.
Some benchmarks include various difficulty levels of problems, giving us a more nuanced understanding of how scaling behaves as the problems become more difficult.
“The availability of difficulty tags for Omni-Math, TSP, 3SAT, and BA-Calendar allows you to analyze accuracy and token usage scale with difficulty in inference time scaling.
The researchers evaluated the Pareto frontier of LLM inference by analyzing both accuracy and computational cost (i.e., the number of tokens generated). This helps you identify how efficiently the model achieves the results.

They also introduce a “traditional gap” measure that compares the possible performance of possible performance from traditional models (using the ideal “best N” selection) with the average performance of the inference model, and estimates potential benefits that can be achieved through better training or validation techniques.
More calculations aren’t always the answer
This study provided some important insights that challenged general assumptions about inference time scaling.
The advantages vary widely. These tasks generally tune models for inference more tuned than traditional tasks, but the extent of improvement varies widely depending on the particular domain and task. As the complexity of the problem increases, profits often decrease. For example, the performance improvements seen in mathematical problems were not always equally translated into scientific reasoning and planning tasks.
The inefficiency of tokens is intense. Researchers observed high variability in token consumption, even among models that achieve similar accuracy. For example, in the AIME 2025 Math Benchmark, Deepseek-R1 used more than five times the tokens of the Claude 3.7 sonnet to obtain roughly comparable average accuracy.
More tokens do not lead to higher accuracy: Contrary to the intuitive idea that longer inference chains mean better inference, this study found that this is not always true. “Amazingly, we also observe that longer generations compared to the same model can be indicators of models struggling, rather than improvements in reflexes,” the paper states. “Similarly, when comparing different inference models, when there are high token usages, they are not always associated with better accuracy. These findings motivate the need for a more purposeful and cost-effective scaling approach.”
Non-corticalism of cost: Perhaps the most concern for enterprise users, repeating queries to the same model for the same problem can result in highly variable token use. This means that even if the model consistently provides the correct answer, the cost of running a query can fluctuate significantly.

Validation Mechanism Possibility: Scaling performance was consistently improved across all models and benchmarks when simulated with a “complete validation agent” (using optimal results).
The traditional model may match the inference model. By significantly increasing inference calls (up to 50x or more in some experiments), traditional models such as GPT-4o may approach the performance level of dedicated inference models, especially for less complex tasks. However, these benefits drop rapidly in very complex settings, indicating that brute force scaling is limited.

Impact on companies
These findings carry considerable weight for developers and LLMS company employers. The “non-deterministic” issue is particularly severe, making budgeting difficult. As researchers point out, “Ideally, developers and users would prefer a model with a low standard deviation regarding token use per instance in the case of cost predictability.”
“The profiling done in Microsoft Research can be useful for developers as a tool to select models with less volatile for the same or different prompts. “Ideally, you’d want to choose models with lower standard deviations for the correct input.”

This study also provides appropriate insight into the correlation between model accuracy and response length. For example, the following diagram shows that the possibility that mathematics is correct beyond the length of ~11,000 tokens is very slim, and those generations must either stop at that point or restart with some sequential feedback. However, Nushi points out that these models that allow post-relaxation also have clean separations between the correct and incorrect samples.

“Ultimately, it’s the responsibility of the model builder, and it’s a non-critical cost to think about reducing accuracy and costs, and we expect this to happen as this method matures more,” Nusi said. “Along with costless non-determinism, non-determinism is also applied.”
Another important finding is a consistent performance boost from a complete validation agent that highlights key areas of future work. This is the construction of a robust and widely applicable verification mechanism.
“The availability of stronger validators can have different types of impact,” Nushi said. “If used efficiently, these can shorten traces of inference.”
Powerful validation agents are also a central part of your enterprise agent AI solution. Many company stakeholders have already implemented such validation agents and need to be reused for more agent solutions such as SAT solvers, logistic validity checkers, etc.
“The future question is how these existing techniques can be combined with AI-driven interfaces, and what language does the two connect?” Nushi said. “The need to connect the two comes from the fact that users don’t always formulate queries in a formal way, and you’ll want to use a natural language interface and expect a solution in a similar format or final action (for example, suggesting a meeting invitation).”