It’s great to see the community actively engaging with the new MiniMax M2, with many highlighting its impressive skills in complex agent tasks. This is especially exciting for me because my work has focused on the post-training agent conditioning part. In this post, I would like to share some of the key insights and lessons learned during that process.
Real agent coordination problems: Benchmark or reality?
If you’ve ever used an LLM agent, you’ve probably felt this pain. The same model may look great in one framework but be useless in another. Agents can use tools to beat leaderboards but still fail spectacularly at simple real-world tasks. The gap between benchmark performance and real-world usability is one of the biggest challenges in this field.
When we designed M2, we knew we had to tackle this issue head-on. This allowed us to achieve two central, and sometimes conflicting, goals.
Open source benchmark for Excel. Benchmarking is essential for measuring “pure” capabilities. For example, benchmarks like BrowseComp test advanced search skills. Users rarely ask unnatural questions like “Find a paper where the third letter of the nth author’s name is ‘x’,” but a model that can solve it proves that it has strong fundamental capabilities. Generalizes robustly to the real world. This is the harder and more important part. A good agent should be able to run reliably with unfamiliar tools, IDE/CLIs, agent scaffolding, and user setups. Don’t be a one-trick pony. It needs to be generalized.
So who do we work with? The answer is both. We build skills to benchmark, but ultimately we need to collaborate with users by making those skills work everywhere.
How to achieve benchmarks is a deep topic for another day, but I’d like to focus on the second, more difficult goal: how to train agents for real-world environments.
The need for interleaved thinking
Early on in the project, we hit a frustrating wall. My agent was performing inconsistently and I had a hard time diagnosing why. After many discussions especially with Professor @Junxian He and @Wenhu Chen, we have reached our first major conclusion. That is, agents need interleaved thinking.
This means that the agent’s internal monologue (its “thoughts”) can and should occur at any point during the task, not just once at the beginning as in standard reasoning models. This design is important for two reasons:
Stay focused on long-term tasks. Complex agent tasks involve very long contexts. A single initial thought process is not enough to follow instructions and remain consistent. Adaptation to external perturbations. This is the crucial difference. The agent’s task introduces constant and unpredictable perturbations from the external world (i.e., the tool’s output). The model must be robust enough to handle these perturbations, diagnose errors, and extract useful information. The “thinking” process allows the model to constantly re-evaluate and adapt to new information from the environment.
This principle became the basis of M2’s effectiveness.
Pro tip for M2 users: M2 relies on interleaved thinking, so its context is its memory. For best performance, you should keep a complete session history, including thought steps. We’ve noticed that much of the community feedback about performance gaps stems from accidentally discarding this important context. This is common in simpler inference models.
True generalization is about perturbations
Our initial theory was simple. Scaling tools is a generalization of agents.
We started with a minimal set of tools (Python interpreter, search engine, and browser) to build a baseline of tool invocation functionality. The roadmap was clear. As we expand the number and diversity of tools, the ability of agents to generalize invisible tools naturally follows.
This worked fine at first. Benchmark scores have increased to a significant level. But as I dug deeper, I realized that I was solving the wrong problem. This model passed our tests, but any changes to the environment, such as replacing it with a different scaffolding framework, caused a sudden drop in performance. The situation was still far from the “practical” model that we were aiming for.
This led to a second, deeper realization. Agent generalization is not just about adapting to new tools. It’s about adapting to perturbations across the model’s operational space.

It sounds abstract, so let’s break it down. Think about all the things that can change in a single agent task.
Tool information and available toolsets. System prompts that define agent personas and rules. User prompts and their specific goals. The environment itself (files, codebase, API). Tool responses returned at each step. Our old “tool scaling” approach only addressed the first item. Perturbations in all other parts of the process were ignored. Based on this new understanding, our team built a comprehensive data pipeline designed for full trajectory generalization. The generated data trains the model to be stable to perturbations at each step. The results were incredibly encouraging. In internal testing, we introduced M2 to an obscure “cold start” scaffolding (a framework we had given little consideration to before) and its performance exceeded our expectations. Both tool invocation and the ability to follow instructions were successfully generalized.
What’s next?
Our work on M2 taught us a tremendous amount about agents, generalization, and data, but it also raised more questions than answers. Many of our ideas are still on the whiteboard. Over the coming months, we can’t wait to explore these frontiers even deeper and bring you the next generation of powerful and truly useful models.
participate
Using the Model: We sincerely hope you test M2. You can access it through official channels or find an open source version and do your own research. Join our team: If a challenge like this excites you, we’re hiring. We’re always looking for passionate people to join us in our mission to build AGI. Please send your resume!

