RAGEN: AI Framework Addresses Instability in LLM Agents
Friday, Apr 25, 2025

Researchers have developed RAGEN, an AI framework to address the instability of LLM agents when managing intricate scenarios.
Training these AI agents involves substantial challenges, especially when decisions require multiple steps and face unpredictable environmental feedback. Although reinforcement learning (RL) has been successful in static tasks such as solving math problems or generating code, its use in dynamic, multi-turn agent training has not been widely explored.
To bridge this gap, a team from institutions like Northwestern University, Stanford University, Microsoft, and New York University introduced StarPO (State-Thinking-Actions-Reward Policy Optimisation).
StarPO offers a comprehensive method for training agents at the trajectory level, optimizing the complete sequence of interactions rather than individual steps.
Alongside this is RAGEN, a modular system created to implement StarPO. This system allows for the training and evaluation of LLM agents, emphasizing their reasoning abilities under RL. RAGEN provides the essential infrastructure for rollouts, reward assignment, and optimization within multi-turn, stochastic (randomly determined) environments.
To separate core learning challenges from confounding factors like vast pre-existing knowledge or task-specific engineering, the researchers evaluated LLMs using RAGEN in three controlled, minimalistic symbolic gaming environments.
These environments enable a clear evaluation of how agents develop decision-making policies solely through interaction.
The study highlighted three important findings regarding the training of self-evolving LLM agents:
The ‘Echo Trap’ and the need for stability
A frequent issue found during multi-turn RL training was called the “Echo Trap.” Agents initially showed improvement but later experienced performance decline, overfitting to locally rewarded reasoning patterns.
This was characterized by a decrease in reward variance, reduced randomness (entropy), and sudden spikes in gradients, indicating training instability. Early indications included drops in reward standard deviation and output entropy.
To address this, the team created StarPO-S, a stabilized version of the framework. StarPO-S incorporates:
StarPO-S consistently delayed collapse and enhanced final task performance compared to the standard StarPO.
Rollout quality is crucial
The characteristics of the ‘rollouts’ (simulated interaction sequences used for training) heavily influence learning. Key identified factors include:
Maintaining freshness, along with proper action budgets and task diversity, is essential for stable training.
Reasoning requires careful reward design
Merely prompting models to ‘think’ doesn't ensure that meaningful reasoning emerges, particularly in multi-turn tasks. The study indicated:
This implies that ordinary trajectory-level rewards (often sparse and outcome-based) are inadequate.
“Without fine-grained, reasoning-aware reward signals, agent reasoning scarcely emerges through multi-turn RL.”
The researchers suggest that future work should investigate rewards that assess the quality of intermediate reasoning steps explicitly, using perhaps format-based penalties or rewarding explanation quality, rather than merely final outcomes.
The RAGEN system and StarPO framework represent progress towards training LLM agents capable of reasoning and adapting through interaction in complex, unpredictable environments.
This research emphasizes the unique stability challenges posed by multi-turn RL and offers specific strategies—like StarPO-S's filtering and stabilization techniques—to mitigate them. It also highlights the critical role of rollout generation strategies and the need for more sophisticated reward mechanisms to nurture genuine reasoning, rather than superficial strategies or misconceptions.
Why does your RL training often collapse?
In our new RAGEN paper, we explore what breaks during the training of LLM *Agents* with multi-turn reinforcement learning—and how to possibly fix it.
1/
Despite acknowledging limitations—such as the necessity to test on larger models and optimize for domains lacking easily verifiable rewards—the research heralds a scalable and principled path for building AI systems in areas like theorem proving, software engineering, and scientific discovery, where complex interaction and verifiable outcomes are essential.
(Image by Gerd Altmann)
Latest News
Here are some news that you might be interested in.

Friday, Apr 25, 2025
Reviving Europe's €200 Billion AI Aspirations in the Digital Economy
Read more

Friday, Apr 25, 2025
Group Challenges OpenAI's Move Away from Nonprofit Foundations
Read more

Friday, Apr 25, 2025
Microsoft Uncovers $4 Billion in Prevented Fraud Amidst Surge in AI-Driven Scams
Read more

Thursday, Apr 24, 2025
Huawei Commences Large-Scale Shipping of Ascend 910C Despite US Restrictions
Read more