Researchers have introduced Ragen, an AI framework designed to counteract the instability of LLM agents when handling complex situations.
Training these AI agents presents important hurdles, especially when decisions take place over multiple steps, with unpredictable feedback from the environment. Reinforcement learning (RL) shows promise in static tasks such as solving mathematical problems and generating code, but applications to dynamic multi-turn agent training have not been explored much.
To tackle this gap, a joint team from institutions such as Northwestern University, Stanford University, Microsoft, and New York University have proposed STARPO (policy optimization that considers the state’s thinking).
STARPO provides a generalized approach for training agents at the trajectory level (i.e., optimizing the entire sequence of interactions, not just individual actions).
This comes with Ragen, a modular system built to implement Starpo. This allows training and evaluation of LLM agents, focusing on inference functions under RL in particular. Ragen provides the infrastructure needed for rollouts, reward allocation, and optimization within a multi-turn, stochastic (randomly determined) environment.
Minimalist environment, the greatest insight
Researchers tested LLM using Ragen in three intentionally minimal and controllable iconic gaming environments to isolate core learning tasks from confounding factors such as existing extensive knowledge and task-specific engineering.
Bandit: A turn, probabilistic task that tests risk-sensitive symbolic reasoning. Agents select options (such as “Phoenix” or “Dragon” arm) with different, initially unknown reward profiles.
These environments allow for clear analysis of how agents learn decision policies purely through interaction.
Important findings: Stability, rollouts, and reasoning
This study provided three important findings on training self-evolution LLM agents.
“Ecot wrap” and the need for stability
The repetitive problem observed during multi-turn RL training was called “ekort wrap”. Agents improve first, but then suffer a performance breakdown and overfits inferential patterns that are rewarded on the ground.
This was characterized by disruption of reward variance, reduced entropy (a measure of randomness/exploration), and sudden spikes of gradients (indicating training instability). Early indications included reward standard deviation and drops of output entropy.
To combat this, the team developed Starpo-S, a stable version of the framework. STARPO-S is built in:
Dispersion-based trajectory filtering: Focusing on task instances where agent behavior exhibits higher uncertainty (higher reward variance), low displacement discard, non-beneficial rollout. This has improved stability and efficiency. Critics Founding: Using methods such as PPO (proximal policy optimization), which employs “critics” to estimate value, we showed better stability than non-critics methods such as GRPO (group relative policy optimization) in most tests. Isolated clipping and KL removal: Adapted techniques and KL divergence penalties (encourage exploration) from other studies (DAPO) that included asymmetric clipping (enables more positive learning from positive rewards) further enhanced stability and performance.
STARPO-S consistently slowed the collapse and improved performance on the final task compared to the vanilla starpo.
The quality of the rollout is very important
The properties of “rollouts” (simulated interaction trajectories used for training) have a significant impact on learning. The key factors identified were:
Task diversity: Training with a diverse set of initial states (prompts). Sweet spots appeared to be moderate diversity that allowed contrast between different outcomes in similar scenarios. Granularity of interaction: Allow multiple actions per turn (around 5-6 has proven optimal), but allows for better planning within fixed turn limits without introducing noise associated with excessively long action sequences. Rollout Frequency: It is essential to use a fresh, up-to-date rollout that reflects the agent’s current policy. More frequent sampling (approaching the “online” setting) leads to faster convergence and better generalization by reducing policy data discrepancies.
Maintaining freshness along with the right action budget and variety in tasks is key to stable training.
Reasoning requires careful reward design
In particular, multi-turn tasks, simply encouraging the model to “think” does not guarantee that meaningful reasoning will occur. In this study,
The traces of reasoning helped generalize the simpler, one-turn bandit task, even if the symbolic cues contradicted the reward. For multi-turn tasks like Sokoban, the advantages of inference were limited, and the length of the “thinking” segment was consistently reduced during training. Agents often regressed to direct action choices, or only tracked the success of the task, and created “hastique inference” when they revealed “discordance between thought and environmental states.”
This suggests that standard trajectory-level rewards (often sparse and outcome-based) are insufficient.
“Without a fine, inference-conscious reward signal, agent inference rarely appears through multi-turn RL.”
Researchers suggest that future work should explore rewards that explicitly assess the quality of intermediate inference steps, using form-based penalties or reward explanatory quality rather than final outcomes.
Ragen and Starpo: A Step into Self-Evolution AI
The Ragen System and Starpo framework represent the steps to train LLM agents that can infer and adapt through interactions in complex, unpredictable environments.
This study highlights the unique stability challenges posed by multi-turn RL and provides concrete strategies such as Starpo-S filtering and stabilizing techniques to mitigate them. It also highlights the important role of rollout generation strategies and the need for a more sophisticated reward mechanism to foster authentic inference rather than superficial strategies or hallucinations.
While acknowledging the limitations that include the need to test with a larger model and optimize domains without easily verifiable rewards, this work opens a “scalable and principled path” in areas that require complex interactions and verifiable results, such as complex interactions and theorems, software engineering, scientific discoveries, and more.
(Image by Gerd Altmann)
See: What about the AI Judge? Humanity is studying Claude’s values
Want to learn more about AI and big data from industry leaders? Check out the AI & Big Data Expo in Amsterdam, California and London. The comprehensive event will be held in collaboration with other major events, including the Intelligent Automation Conference, Blockx, Digital Transformation Week, and Cyber Security & Cloud Expo.
Check out other upcoming Enterprise Technology events and webinars with TechForge here.