Yesterday, Openai released Deep Research, a system that browses the web to summarise content and answer questions based on summary. The system was impressive and blew my mind the first time I tried it.
One of the main results of the blog post is a significant improvement in performance with the general AI Assistant Benchmark (GAIA). This is the benchmark I’ve been playing recently, and I managed to get a nearly 67% correct answer on one shot. On average, 47.6% on particularly challenging “Level 3” questions, particularly involving multiple steps in inference and using tools (see below for GAIA presentation).
DeepResearch is an internal “agent” that guides LLM (you can choose from the current list of LLMs provided by OpenAI, 4O, O1, O3, etc.) and LLM to organize actions using tools such as web search. It consists of a framework. With steps.
While powerful LLMs are now open source and freely available (see more recent Deepseek R1 models), Openai has not revealed much about the agent framework underlying deep research…
So we decided to recreate the results and set out on a 24-hour mission to open source the necessary frameworks along the way!
The clock is ticking every moment, let’s go! ⏱⏱️
table of contents
What is an Agent Framework?
The agent framework is a layer at the top of LLM that organizes its operations in a series of steps, such as browsing the web or reading PDF documents. For a quick agent intro, check out this incredible interview with Andrew Ng and an introduction blog post to the Smolagents Library. For more detailed diving for agents, you can subscribe to the agent courses that start in a few days. Please link here.
Most people have already experienced how powerful LLMS is just playing on Chatbots. But what everyone knows yet is that by integrating these LLMs into an agent system, we can give them real superpowers.
Here is a recent example comparing the performance of Frontier LLM with and without the agent framework (in this case Simple Smolagents Library):
In fact, Openai highlighted how dramatically deep search is better than standalone LLM in its knowledge-intensive “The Last Exam of Humanity” benchmark.
So, what happens when you integrate the current top LLM into your agent framework and work towards an open reepearch?
A quick note: I benchmark the results of the same Gaia Challenge, but please note that this is an ongoing work. Deepresearch is a massive achievement, and its open replication takes time. In particular, complete parity requires improved browser usage and interactions like those provided by Openai operators. This means that it goes beyond the current text-only web interactions that we investigate in this first step.
First, let’s understand the scope of the task: Gaia.
Gaia Benchmark
Gaia is undoubtedly the most comprehensive benchmark for agents. That question is extremely difficult and conflicts with many challenges in LLM-based systems. Here is an example of a difficult question:
Which of the fruits shown in the 2008 painting “Embroidery from Uzbekistan” was served as part of the October 1949 Ocean Liner breakfast menu. Pass items as a comma-separated list and order clockwise based on their layout in the painting starting at 12 o’clock. Use the plural form of each fruit.
This question has several challenges.
Collect some information, using multimodal features (to extract fruit from images) answer in a constrained format. Nautical” Find the October 1949 breakfast menu for Ocean Liner above and connect the problem-solving trajectories in the correct order.
To resolve this, both high level of planning ability and strict execution are required. These are two areas that LLM struggles with when used alone.
An excellent set of tests for the agent system!
On Gaia’s public leaderboard, GPT-4 doesn’t even reach 7% in the validation set when used without agent setup. On the other side of Spectrum, deep research has led Openai to a score of 67.36% on the validation set, making it a few orders of magnitude better! (Though I don’t know how they actually freight on a private test set.)
Let’s see if open source tools can do better!
Building open and deep research
Use Codeagent
The first improvement over the traditional AI agent systems we are working on is to use so-called “code agents.” As Wang et al shows. (2024), having agents represent actions in code have several advantages, but most notably, their code is specifically designed to represent a complex set of actions.
Wang et al. Consider this example by:
This highlights some of the benefits of using code.
The code actions are much more concise than JSON. Do I need to run four parallel streams of five consecutive actions? In JSON, each must generate 20 JSON blobs in a separate step. In code, there’s only one step. On average, this paper shows that code actions require 30% fewer steps than JSON, and that the generated tokens correspond to comparable reductions. LLM calls are often the dimension cost of the agent system, meaning running an agent system is ~30% cheaper. With code, you can reuse tools from popular libraries from better performance in benchmarks for two reasons.
The above advantages were confirmed by experiments with agent_reasoning_benchmark.
You can also cite notable additional benefits from building Smolagents. This is a better handling of condition. This is especially useful for multimodal tasks. Do I need to save this image/audio/etc for later use? There’s no problem. Simply assign it as a state variable and you can reuse the four steps if necessary. In JSON, you must name your LLM dictionary keys and trust that LLM is still available.
Create the right tools 🛠️
Next, you need to provide the agent with the appropriate toolset.
1. Web browser. To reach full performance, you’ll need full-scale web browser interactions like an operator, but for the first concept, I started with a very simple text-based web browser. You can find the code here
2. A simple text inspector who can read a bunch of text file formats find it here.
These tools were taken from the outstanding Magentic-One agent by Microsoft Research, praise. Our goal was to get as high a performance as possible with the lowest possible complexity, so we didn’t change them much.
This is a short roadmap of improvements that I think will really improve the performance of these tools (please open your PR and contribute!):
Expands the number of file formats that can be read. We suggest more fine-tuned processing of the file. Replace your web browser with a vision-based one. This started here.
Results 🏅
Over 24 hours of breeding sprints have steadily improved the performance of agents in Gaia!
We quickly rose from previous SOTA using Magent-One’s open framework (approximately 46% of Magent-One).
This performance bump is primarily due to having the agent write actions in code! In fact, switching to a standard agent that writes actions in JSON instead of code will immediately decompose the performance of the same setup to an average of 33% of the validation set.
This is the final agent system.
Set up a live demo here and try it out!
But this is just the beginning and there is a lot to improve! Open tools can be improved and Smolagents framework can also be tweaked. We also want to explore better open model performance to support agents.
We welcome communities who participate in this effort. This allows you to leverage the power of open research to build a great open source agent framework. With a completely local and customized approach, anyone can run agents like Deepresearch at home using their favorite models.
Replica of the community
While we were working on this and focused on Gaia, other excellent and open implementations of deep research emerged, especially from the community.
Each of these implementations uses different libraries for indexing data, web browsing, LLMS queries. In this project, I would like to replicate the benchmark presented by OpenAI (Pass@1 average score), benchmark the benchmark, and switch to LLMS (such as Deepseek R1) to document the findings. I think so. agent.
The most important next step
Openai’s deep research is probably boosted by the excellent web browsers featured in the operators.
So we’re working on that next! A more common problem is to build a GUI agent, an agent that can display the screen and act directly with the mouse and keyboard. If you are excited about this project and would like to make it accessible to everyone through open source to such cool features, we would like to make your contribution.
We also hire full-time engineers to help us tackle this.