Back to News

RAGEN: AI Framework Addresses Instability in LLM Agents

Friday, Apr 25, 2025

RAGEN: AI Framework Addresses Instability in LLM Agents

Researchers have developed RAGEN, an AI framework to address the instability of LLM agents when managing intricate scenarios.

Training these AI agents involves substantial challenges, especially when decisions require multiple steps and face unpredictable environmental feedback. Although reinforcement learning (RL) has been successful in static tasks such as solving math problems or generating code, its use in dynamic, multi-turn agent training has not been widely explored.

To bridge this gap, a team from institutions like Northwestern University, Stanford University, Microsoft, and New York University introduced StarPO (State-Thinking-Actions-Reward Policy Optimisation).

StarPO offers a comprehensive method for training agents at the trajectory level, optimizing the complete sequence of interactions rather than individual steps.

Alongside this is RAGEN, a modular system created to implement StarPO. This system allows for the training and evaluation of LLM agents, emphasizing their reasoning abilities under RL. RAGEN provides the essential infrastructure for rollouts, reward assignment, and optimization within multi-turn, stochastic (randomly determined) environments.

To separate core learning challenges from confounding factors like vast pre-existing knowledge or task-specific engineering, the researchers evaluated LLMs using RAGEN in three controlled, minimalistic symbolic gaming environments.

These environments enable a clear evaluation of how agents develop decision-making policies solely through interaction.

The study highlighted three important findings regarding the training of self-evolving LLM agents:

The ‘Echo Trap’ and the need for stability

A frequent issue found during multi-turn RL training was called the “Echo Trap.” Agents initially showed improvement but later experienced performance decline, overfitting to locally rewarded reasoning patterns.

This was characterized by a decrease in reward variance, reduced randomness (entropy), and sudden spikes in gradients, indicating training instability. Early indications included drops in reward standard deviation and output entropy.

To address this, the team created StarPO-S, a stabilized version of the framework. StarPO-S incorporates:

StarPO-S consistently delayed collapse and enhanced final task performance compared to the standard StarPO.

Rollout quality is crucial

The characteristics of the ‘rollouts’ (simulated interaction sequences used for training) heavily influence learning. Key identified factors include:

Maintaining freshness, along with proper action budgets and task diversity, is essential for stable training.

Reasoning requires careful reward design

Merely prompting models to ‘think’ doesn't ensure that meaningful reasoning emerges, particularly in multi-turn tasks. The study indicated:

This implies that ordinary trajectory-level rewards (often sparse and outcome-based) are inadequate.

“Without fine-grained, reasoning-aware reward signals, agent reasoning scarcely emerges through multi-turn RL.”

The researchers suggest that future work should investigate rewards that assess the quality of intermediate reasoning steps explicitly, using perhaps format-based penalties or rewarding explanation quality, rather than merely final outcomes.

The RAGEN system and StarPO framework represent progress towards training LLM agents capable of reasoning and adapting through interaction in complex, unpredictable environments.

This research emphasizes the unique stability challenges posed by multi-turn RL and offers specific strategies—like StarPO-S's filtering and stabilization techniques—to mitigate them. It also highlights the critical role of rollout generation strategies and the need for more sophisticated reward mechanisms to nurture genuine reasoning, rather than superficial strategies or misconceptions.

Why does your RL training often collapse?

In our new RAGEN paper, we explore what breaks during the training of LLM *Agents* with multi-turn reinforcement learning—and how to possibly fix it.

📄
🌐
1/🧵👇

Despite acknowledging limitations—such as the necessity to test on larger models and optimize for domains lacking easily verifiable rewards—the research heralds a scalable and principled path for building AI systems in areas like theorem proving, software engineering, and scientific discovery, where complex interaction and verifiable outcomes are essential.

(Image by Gerd Altmann)

Latest News

Here are some news that you might be interested in.