Give the community the authority to research agents

In an ideal world, AI agents become trusted assistants. Given a query, it easily manages the ambiguity of instructions, develops step-by-step plans, correctly identify the resources needed, execute those plans without falling aside, and adapts to unexpected events. However, developing agents and testing these behaviors is not a small feat. If you’ve ever tried to debug your own agent, you’ve probably observed how boring and frustrating this can be. The existing evaluation environment is closely linked to the task of evaluating, lacks real-world flexibility, and does not reflect the messy reality of open-world agents. The simulated page does not fail to load, and the event does not appear naturally.

Therefore, we will introduce Gaia2, a follow-up to the agent benchmark GAIA, allowing for analysis of rather complex behaviors. GAIA2 is released in the Open MetaAgent Research Environment (ARE) framework to run, debug and evaluate agents. It can simulate complex real-world-like conditions and customize the behavior of further research agents. The GAIA2 dataset is released in CC with a 4.0 license and is licensed with MIT.

GAIA2: Agent evaluation for actual assistant tasks

Gaia is an agent benchmark published in 2023, with three levels of information search questions requiring tools, web browsing, and reasoning to resolve. Two years later, it’s time for a whole new agent benchmark as the easiest level for models becomes easier and the community is getting closer to solving the most difficult questions!

Gaia2 is far beyond that in terms of the features that Gaia2 has emerged and investigated!

When Gaia is read-only, Gaia2 is now a benchmark for read and write, focusing on interactive behavior and complexity management. Currently, agents are being evaluated not only for search and search, but also for instructions to follow ambiguous or time-sensitive queries. This is a noisy environment with controlled obstacles, reflecting more real-world conditions than any other mock environment. I would like to test how agents manage tools and APIs that don’t work, plan for continuity of actions in a very specific time frame, and adapt to new events. It’s a whole new range of complexity.

To do this, use the following task group (thanks to the 1000 new human-created scenarios):

Run: Follow and use tools for multi-step instructions (e.g. Contact updates) Search: Collect cross-source information (e.g. Friends’ city from WhatsApp) Handling ambiguity: Clarification of conflicting requests (e.g. scheduling conflicts) Adaptability: Changes to simulation (e.g. Email updates using follow-up information) Time/time instructions (e.g. Time-dependent instructions) Collaboration: Communication Noise tolerance between agents without direct API access: Robustness to API failures and environmental instability

In Gaia’s spirit, the scenario does not require specialized knowledge. Humans must be able to earn 100% in principle. This allows for easy debugging for model developers.

Want to explore the benchmarks? See the dataset. This demo will give you a better view.

How does Gaia2 run?

GAIA2 runs on ARE. This allows selected agents to access the application and associated pre-settlement data combinations.

In the case of Gaia2, we created a smartphone mockup environment to simulate what humans use in their daily lives. It includes real applications such as messaging (email), utilities (calendars, contacts, shopping, file systems, etc.). All applications are accessible by agents via tool calls. Last but not least, the demo also includes a history of simulated persona conversation and app interactions.

All agent interactions are automatically recorded as structured traces during deep diving and analysis execution. These include tool calls, API responses, model thinking, timing metrics (such as response latency), user interaction, and more, all of which can be exported as JSON.

result

For reference, compare the ranges of all inference modes for Llama 3.3-70b Instruct, Llama-4-Maverick, GPT-4O, QWEN3-235B-MOE, GROK-4, KIMI K2, GEMINI 2.5 Pro, Claude 4 Sonnet, and GPT-5.

All models are evaluated using the same setup (uniform reaction loop for consistency, temperature of 0.5, 16K token generation limit), and based on the specific task, using the model As-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa All 101 tools (and general environment descriptions) are provided at the system prompt.

Of the models evaluated, the overall highest score model for September 2025 has high inference in GPT-5, with the best open source model being Kimi K2.

Some features appear to have already been resolved in the best models: simple tool invocations and instruction execution (execution), and overall search (as we could infer from the current results of GAIA). Ambiguity, adaptability, and noise splits remain challenging for now in all models, and it is interesting to see that performance of what was considered a complex agent task (instruction follow and search) is not a good proxy for performance in tasks in the near-real world. Last but not least, the most difficult division of all models at this time is time. At this point, it is very difficult for a model to properly handle time-sensitive actions (although this may be mitigated by specialized tools and better temporal inference). A detailed analysis of these results is provided in the paper.

However, we believe it is important to push the report beyond the raw score. If the model is correct, but spends thousands of tokens and reaches the correct solution or runs for hours, it’s “not good” as a model that’s been successful several orders of magnitude faster. Therefore, the cost score is normalized and quantified as the average number of LLM calls and output tokens (both define cost-performance Pareto frontiers). The paper finds score vs. currency costs and time.

Compare it with your favorite models! GAIA2 review

If you want to evaluate your model in GAIA2, you can follow these steps:

First, install Meta’s agent research environment in your selected Python environment (UV, Conda, Virtualenv, …)

PIP Install Meta-Agents-Research-environments

Next, we run benchmarks for all configurations: run, search, adaptability, time, ambiguity. Don’t forget to upload all the results to the hub using hf_upload kwarg!

Running Benchmarks -HF Meta-Agents-Research-Environments/gaia2-Split VALIDATION -CONFIG CONFIGURATION -MODEL YOUR_MODEL -MODEL_PROVIDER YOUR_PROVIDER -AGENTDefault -MAX_CONCURRENT_SCENARIOS 2 – SCENARIO_TIMEOUT 300-OUTPUT_DIR ./MONITRED_TEST_RESULTS -HF_UPLOAD

Run Oracle to get the aggregated score file

Benchmark Judge-HF Meta-Agents-Research-Environments/gaia2-Split VALIDATION -CONFIG Configuration -Agent Default – Max_Concurrent_Scenarios 2 – scenario_timeout 300 -output_dir ./monitored_test_results – hf_upload your_hub_dataset_to_save_results

Finally, add all the relevant information about ReadMe’s models and share it on the Leaderboard to concentrate Gaia2 traces here!

Beyond Gaia2: Research Agents

You can use GAIA2 apps and content beyond benchmark scenarios. Check if the model can correctly resolve non-verable tasks such as loading emails, writing follow-ups, adding events to the calendar, and booking meetings.

You can also easily customize your environment by 1) connecting tools (MCP or directly) to test the agent. 2) Implement your own scenarios that include definitions of triggers and timed events (e.g. after 2 minutes, the email app will receive new emails from contacts), and see how agents can adapt to the evolving environment

(An agent is a JSON agent by default, so you can’t ruin your machine unless you connect to an external app with unsafe rights of course. So it works with caution when adding your own app or using unreliable MCP)

Below are some of the use cases we used:

Vibe agents check agents with real or simulated data and investigate various setups using their own rules, tools, content, and validation. Invoking test agent tools and orchestration capilites use local apps or MCP tools. Limitations in noisy environments (with API timeouts and ambiguity)

I’ve recorded three videos to see some of these use cases (but of course I hope the community gets creative: hugging_face :). These videos use the default demo above. The defaults above include the simulated lifespan of Linda Renne, a doctoral student in machine learning.

1) Test your agent with simple tasks: Event Organisation

Plan your birthday party to test how good the default model is in your event organization!

First, ask the agent to text everyone in the Rennes family about the user’s 30th birthday party on November 7th. The default universe has 21 contacts on the list, including five Rennes family members, Linda, the simulation “owner”, George and Steffy, parents, Anna’s sister, and Morgan’s grandfather. The agent successfully passes through the contact list, finds a family of four and sends a text message.

Next, ask the agent to create a calendar invitation and add it as an invitation. Agents remember the above context! Create a calendar invitation with the correct date and add your family correctly.

The browser does not support video tags.

2) Agent Understanding: Dig deep into traces

You can also see the traces behind actions taken by the agent. Open the Agent Log tool on the left to see a system prompt, chain of thoughts, multi-step actions taken with the tool called, and results as a neatly organized log. If you want to consult offline, you can export everything as JSON!

The browser does not support video tags.

3) Play and expand the demo: Connect your agent to your MCP

In this last example, you’ll gesture things out, then shake the robot arm and have them answer the yes or “no” question, because they’re connecting to the remote robot arm through the MCP. This is what it looks like.

The browser does not support video tags.

But these examples are just a very simple starting point and are really aiming to see what you build! (For more advanced users, you can also install and edit the metare record directly here.)

Conclusion

GAIA2 is a new research tool that allows simple experiments, allows anyone to access real evaluations, and makes it easier to build more reliable and adaptive AI agents by improving trust through transparent, reproducible benchmarks and debuggable traces.

I’d love to see what I’ll do with this project!

versatileai

See Full Bio

What's Hot

Enterprise AI Obstacles and Roadmap, Security and Physical AI: TechEx Day 2

AI is a power, infrastructure and security issue: TechEx North America

NVIDIA releases 6 million multilingual inference datasets

Enterprise AI Obstacles and Roadmap, Security and Physical AI: TechEx Day 2

AI is a power, infrastructure and security issue: TechEx North America

NVIDIA releases 6 million multilingual inference datasets

The Judiciary contributes to the National AI Strategy in major consultation forums

Pillar Security raises $9 million to create AI security guardrails for businesses

How to use Olympic coders locally for coding

Most Popular

The Judiciary contributes to the National AI Strategy in major consultation forums

Pillar Security raises $9 million to create AI security guardrails for businesses

How to use Olympic coders locally for coding

Don't Miss

Enterprise AI Obstacles and Roadmap, Security and Physical AI: TechEx Day 2

AI is a power, infrastructure and security issue: TechEx North America

NVIDIA releases 6 million multilingual inference datasets

Subscribe to Updates

What's Hot

Give the community the authority to research agents

GAIA2: Agent evaluation for actual assistant tasks

How does Gaia2 run?

result

Compare it with your favorite models! GAIA2 review

Beyond Gaia2: Research Agents

1) Test your agent with simple tasks: Event Organisation

2) Agent Understanding: Dig deep into traces

3) Play and expand the demo: Connect your agent to your MCP

Conclusion

Related Posts