Large-scale language models (LLMs) are rapidly transformed into autonomous agents capable of performing complex tasks that require inference, decision-making and adaptability. These agents are deployed for web navigation, personal support and software development. To act effectively in a real setting, these agents must handle multi-turn interactions that span several steps or decision points. This introduces the need for training methods beyond simple response generation, instead focusing on optimizing the overall trajectory of interaction. Reinforcement learning (RL) has emerged as a compelling approach to training such agents by improving decision-making based on long-term rewards.
Despite its possibilities, LLM-based agents struggle with multi-turn decisions. The main challenge is allocating appropriate credits to actions taken during the early stages of the interaction, which affects later outcomes. Traditional training methods mimic high probability actions that rely on predictions of the next token or do not take into account long-term dependencies or cumulative goals. As a result, these methods fail to address the high variance and inefficiency of elder tasks, especially in collaborative scenarios where understanding human intentions and inferences in multiple steps is important.
Various reinforcement learning techniques have been adapted to fine-tune LLMS, especially from single-turn human feedback scenarios. Tools such as PPO, RAFT, and DPO have been investigated, but when applied to continuous interactions, there are major limitations. These methods often fail with effective credit allocations throughout the turn, reducing their effectiveness on multi-turn decision-making tasks. The benchmarks used to evaluate such tools lack the diversity and complexity required to robustly evaluate performance in a collaborative real-world setting. The value-based learning approach is another alternative, but the need for custom heads and a large amount of task-specific fine-tuning data limits generalization capabilities.
Researchers from Meta and UC Berkeley proposed a new reinforcement learning method called Sweet-RL (step-by-step evaluation from training time information). They also introduced a benchmark known as CollaborativeAgentbench or Colbench. This benchmark is at the heart of the research and provides over 10,000 training tasks and over 1,000 test cases in two domains: backend programming and frontend design. Colbench simulates real collaboration between AI agents and human partners. There, agents need to ask questions, improve their understanding and provide iterative solutions. For programming, agents must write functionality in Python by asking for clarification to improve the missing specifications. In front-end tasks, the agent must generate HTML code that matches the visual target through feedback-based modifications. Each task is designed to enhance the agent’s inference ability and mimic real-world constraints such as limited interactions capped at 10 turns per session.
Sweet-RL is built around an asymmetrical actor critic structure. Critics can access additional information during training, including correct solutions that actors cannot see. This information allows critics to assess each decision made by agents with a more detailed resolution. Instead of training a value function that estimates the overall reward, Sweet-RL uses the Bradley-Terry Optimization goal to model the advantage function directly at each turn. Advantage functions determine whether a particular action is compared to an alternative, and help agents learn precise behavior. For example, if the action matches the expectations of a human partner, you will receive a higher advantage score. This method simplifies credit allocation and better coordinates with the LLMS pre-training architecture, which relies on token-level predictions.
Sweet-RL achieved an absolute improvement of 6% over other multi-turn augmented learning methods in both programming and design tasks. In the back-end programming task, it passed 48.0% of the test and achieved a success rate of 34.4%, but at 28.2% for multi-turn DPO and 22.4% for zero-shot performance. The front-end design task achieved a cosine similarity score of 76.9% and a win rate of 40.4%, with 38.6% in DPO and 33.8% in fine tuning. Even when evaluated against the best proprietary models such as the GPT-4O and O1-MINI, Sweet-RL can close the performance gap significantly, allowing the open-source Llama-3.1-8B model to match or exceed the GPT-4o’s 40.4% front-end win rate.
This study shows that effective training of interactive agents rests on accurate, turn-by-turn feedback rather than general value estimation or extensive supervision. Sweet-RL significantly improves credit allocation by leveraging training time information and optimization approaches placed in the architecture. It promotes generalization, reduces training variance, shows strong scalability, and increases data for better results. This algorithm remains effective when applied to out-of-policy datasets, highlighting practicality in real-world scenarios with incomplete data. The researchers created a meaningful evaluation framework by introducing Colbench as a benchmark tailored to realistic multi-turn tasks. This combination with Sweet-RL provides a strong foundation for developing agents that can reason, adapt, and collaborate more effectively than extended interactions.
Some important takeaways from this study are:
Sweet-RL improved backend programming success rate from 28.2% (DPO) to 34.4%, with frontend wins from 38.6% to 40.4%. The Llama-3.1-8B matches the performance of the GPT-4O, reducing its dependence on its own model. Critics use training time information (e.g., the correct solution) that the actor cannot see, to create an asymmetric training setup. Colbench tasks conclude with 10 rounds per session and include over 10,000 procedurally generated training examples. Colbench measures results using unit test pass rate (code) and cosine similarity (for web design) and provides reliable ratings. Sweet-RL learns the turnwise advantage function directly and improves credit allocation without the need for intermediate values functions. The model scales effectively with more data and works well with off-policy datasets for weak models. Compared to traditional fine-tuning methods, Sweet-RL offers better performance with overfitting and greater generalizations.
See the papers, github pages and datasets. All credits for this study will be sent to researchers in this project. Also, feel free to follow us on Twitter. Don’t forget to join 85K+ ML SubredDit.

Nikhil is an intern consultant at MarktechPost. He pursues an integrated dual degree in materials at Haragpur, Indian Institute of Technology. Nikhil is an AI/ML enthusiast and constantly researches applications in fields such as biomaterials and biomedicine. With a strong background in material science, he creates opportunities to explore and contribute to new advancements.