A key recent advancement is exploring reinforcement learning (RL) techniques to improve LLM beyond traditional monitored fine-tuning methods. With RL, the model learns the optimal response through reward signals, enhancing inference and decision-making capabilities. RL introduces a feedback-driven training loop that is better coordinated with human-like learning processes, particularly in tasks that involve step-by-step problem solving or mathematical inference. This intersection of LLMS and RL is becoming a prominent field of academic research and industry innovation.
A central challenge in improving LLM for complex inference tasks is to ensure that these models develop better thinking skills rather than longer outputs. Reinforcement learning-based training in LLMS emerged patterns in which models begin to produce excessively long responses without necessarily improving the quality of the responses. This raises concerns about optimization bias for RL methods that may favor redundancy over accuracy. Another complication arises from the base model itself. Some already show signs of inference ability, making it difficult to isolate the actual effects of RL tuning. Therefore, it is essential to understand how training strategies and model fundamentals affect final performance.
Previously, post-training augmentation learning for LLMS often relied on algorithms such as proximal policy optimization (PPO), commonly used in various open source implementations. These implementations frequently included a response length normalization step. This incorrectly introduced biases that favor longer or shorter outputs, depending on the correctness of the response. In particular, Group Relative Policy Optimization (GRPO) was introduced as a variant to optimize policy updates at the group level. Although effective, GRPO has been criticized for embedding subtle optimization biases that affect the length and quality of model responses. These existing methods are innovative, but show limitations that obscure the actual benefits from reinforced learning.
Researchers from SEA AI Lab, National University of Singapore and the University of Management of Singapore have introduced a new approach called Dr. GRPO (Group relative policy optimization was correctly performed). This method removes the problematic normalization term from the GRPO formulation. Specifically, we eliminate response length and standard deviation scaling factors that caused imbalances in model updates. The revised algorithm calculates the gradient more equitably with different responses and question types. This method was applied to train the open source-based model QWEN2.5-MATH-7B and demonstrate its effectiveness in multiple benchmarks. The training process used 27 hours of computing on an 8xA100 GPU. This is a relatively modest setup taking into account the results achieved.
Researchers have tested methods for notable mathematical inference benchmarks, including AIME 2024, AMC, Math500, Minerva Math, and Olympiadebench. The Dr. GRPO trained model achieved 43.3% accuracy in AIME 2024, significantly outperforming Simplerl-Zero-7B (36.0%), Prime-Zero-7B (27.6%) and OpenReasoner-Zero-7B (16.7%). It also showed strong average performance across all tasks. 40.9% for Math500, 45.8% for Minerva and 62.7% for Olympiadebench. These results validate the validity of the bias-free RL method. Importantly, this model showed improved performance and more efficient use of tokens. The false responses have been shorter and more focused. This is a notable shift from previous training methods that promote overly extended answers, regardless of accuracy.
Beyond training algorithms, the team also examined the nature of the base model used in RL settings such as R1-Zero. They found that several models, such as QWEN2.5, displayed advanced features before training, probably because they pre-deleted concatenated question-solving data. For example, the QWEN2.5-MATH-7B model achieved an average accuracy of 38.2% without tweaking the RL, surpassing many models trained using traditional methods. This existing inference ability complicates claims about the benefits of RL. This is because improvements may be partly due to previous training strategies rather than new learning through reinforcement. Another inspected model, DeepSeek-V3-Base, shows an instance of spontaneous “AHA moment” and self-reflection before RL, further suggesting that some inference skills may already be embedded in the base model.

Performance dynamics were carefully tracked during training. Using Dr. GRPO, the model avoided the tendency to inflate response length. This assessment revealed that Dr. GRPO stabilizes the output length while increasing reward signal, suggesting a direct correlation between training and improved accuracy as well as redundancy. In contrast, traditional GRPOs gradually lengthen false responses and falsely show improvements. This observation is consistent with the findings that many open-source PPO implementations unconsciously introduce response length bias.

The researchers also investigated how different templates and question sets influence model behavior. The QWEN2.5-MATH-1.5B base model worked best without prompt templates, earning 61.6% on Minerva Math and 45.8% on Math500. Surprisingly, using templates before RL recovered often resulted in poor performance. This highlights how the discrepancy between model pretraining and inference forms obscures true inference capabilities. We also challenged the assumption that models trained with small, simple questions such as the GSM-8K often outweigh those trained with larger data sets, and wider coverage always leads to better inference.
Some important points from the study include:
The DeepSeek-V3-Base and QWEN2.5 models show inference ability even before RL, showing a powerful pre-escaping effect. Dr. GRPO eliminates the bias in GRPO by removing the length and reward normalization term and improving token efficiency. The GRPO Dr. and trained QWEN2.5-MATH-7B model was significantly reduced using GRPO using 43.3% Olympiadebench at AIME 2024 43.3% 62.7% MINERVA MATH 40.9% MATH500 at MATH500, mean score: GRPO across all benchmarks. The QWEN2.5 model improves performance without prompt templates and suggests that it may be pre-processed with Q&A formatted data. A small set of questions like the GSM-8K performs better than the larger ones and can compete with expectations. Open source PPO implementations often include unintended response length biases that Dr. GRPO successfully remove.
In conclusion, this study reveals important insights into how RL affects the behavior of large language models. Researchers found that preoraining plays a major role in determining baseline ability. They also demonstrated that optimization bias in common RL algorithms can mislead training and evaluation. The introduction of Dr. GRPO fixed these issues, leading to more interpretable and efficient model training. With just 27 hours of training, their models reached cutting-edge results with major mathematical inference benchmarks. These findings reconstruct the way communities evaluate RL-enhanced LLMs, focusing on method transparency and the characteristics of the base model rather than mere performance metrics.
Please see the paper and the github page. All credits for this study will be sent to researchers in this project. Also, feel free to follow us on Twitter. Don’t forget to join 85K+ ML SubredDit.

Asif Razzaq is CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, ASIF is committed to leveraging the possibilities of artificial intelligence for social benefits. His latest efforts are the launch of MarkTechPost, an artificial intelligence media platform. This is distinguished by its detailed coverage of machine learning and deep learning news, and is easy to understand by a technically sound and wide audience. The platform has over 2 million views each month, indicating its popularity among viewers.