How we achieved cutting-edge technology

Research agents are rapidly becoming one of the most important applications of AI. Research is a fundamental knowledge work task. Gathering, reading, and synthesizing information underpins everything from writing and decision-making to the coding itself. However, human-driven research is subject to memory, reading speed, and time constraints. In contrast, AI research agents can process vast amounts of information, integrate insights instantly, and scale easily. For this reason, research agents are emerging as a top use case for AI today and will soon become a core subcomponent of broader agent workflows across content generation, coding, sales, and more. In this post, we share the technical and philosophical lessons we learned building a cutting-edge research agent and where we believe the field is headed.

Building for the future

agent harness

The task of building an agent harness is to create a software layer that enhances the runtime execution of your model through context management, tool invocation, loop control, orchestration, and error handling. However, building applications on top of rapidly improving models is a modern engineering challenge. How can you design software today to absorb performance improvements from future model releases?

This requires predicting how the model will evolve, being optimistic about its progress, limiting assumptions, and avoiding manual optimization.

We learned this the hard way seven months ago. At that time, we had to abandon our first attempt at detailed investigation and rebuild the entire system from scratch. The initial architecture was complex and sophisticated (which we thought was a good thing), but that prerequisite became a bottleneck when the next generation models arrived.

model

Over the past seven months, the model’s capabilities have quietly but meaningfully evolved (particularly in its tool invocation capabilities). By focusing on this one optimization, we moved from workflows to agents. We believe that future models will be trained to solve current challenges for agent developers. All models are eventually consumed by the harness, so the model must evolve with that harness. We expect to improve model recall summarization (due to context compression), reliability of tool calls, and descriptive brevity.

tool

Similarly, tools will need to evolve to support LLM and widely adopted agent harnesses. The best tools should perform some context engineering on the tool side, abstracted from the agent. Rather than dumping a large number of tokens into the context window, only the most relevant data should be returned. As a tools provider, we have invested heavily in advanced search capabilities with built-in context engineering. This reduces hallucinations and latency for downstream agent processes.

takeout

To build an agent that improves over time, we followed a few basic principles.

Simplify orchestration logic and emphasize autonomy. Pay close attention to which models and tools are optimized and take advantage of their new capabilities. Focus on context engineering (more on this in the next section).

Context engineering — a curation exercise

Long-term research tasks reveal a fundamental challenge in current agent design: maintaining a clean and optimized context window over time. If curation of context is not a task that engineers pay close attention to, the agent is nearly doomed to failure. Below is an overview of our thinking on this concept in a deep research area.

Web search with context management

Using Tavily’s Advanced Search is a natural first step to overcoming this challenge, as it abstracts the processing of raw web content and returns only the most relevant chunks of content from each source. By leveraging this feature, you let Tavily Search do the heavy lifting and let Tavily Research reap the benefits, collecting the most valuable content in a latency-efficient manner.

Preventing agents from overfitting a single investigation thread is the next step to an effective context collection pipeline. Global state persistence and source deduplication are of paramount importance in this regard, and in our case it has three effects:

This ensures that agents are only exposed to the latest information. This allows engineers to recognize when the scope of information is narrowing and prompt agents to explore untapped relevant domains. This aids in effective attribution later in the generation process.

At Tavily, our interactions with the web are our bread and butter. Building a sophisticated web search system designed for deep research was a fundamental component of the overall design of the deep research agent.

Modeling human-web interaction

Humans conduct research in an inherently unstructured and iterative manner. Start by defining your task. That is, define what you are trying to achieve and what information you need. It then collects data from sources, extracts important insights and holds them in short-term memory, and uses these distilled thoughts to guide subsequent actions.

The cycle of gathering information, extracting it, and deciding what to do next repeats. Only after you have gathered enough understanding to create the final product do you go back to the original source and use it as a reference to assemble the final product.

We believe that deep research agents should be designed in a similar way, where the output of the tool should be extracted into reflections and only the set of past reflections should be used as the context for the tool call. Just like humans, raw information must be provided as context to ensure no loss of information only at the point when the agent begins preparing the final product.

Do more with less

This approach differs from traditional context structuring in ReAct agent-based architectures. Typically, tool calls and outputs are propagated through a tool call loop, and previously retrieved/generated tokens are persisted in the context window on each subsequent iteration. This pattern can be seen in LangChain’s Open Deep Research agent implementation, and from a token consumption perspective, can be modeled by the following quadratic series: $The amount of tokens that the tool call model calls in each tool call iteration. The number of iterations of the tool call.$

$\cdots + mn \;=\; n \cdot \frac{m(m+1)}{2}$

In contrast, the context engineering method we propose removes this token propagation (as the knowledge distillation, even if aggregated, is negligible compared to the amount of tokens collected from the web) and can be modeled by the following linear series:

$\cdots + n \;=\; nm$

Comparing the two approaches, tokens are saved per agent by the following factor: $\frac{m+1}{2}$

Through this methodology, we were able to achieve SOTA on DeepResearch Bench while reducing token consumption by 66% (compared to Open Deep Research). In other words, it’s the perfect intersection of quality and efficiency.

Agent production — an ongoing challenge

Building a production-grade agent requires a balance. We focused on autonomy to maximize performance and quality while meeting stringent latency, cost, and reliability requirements.

Engineering with non-determinism

We found that LLM is inherently non-deterministic, and that giving it guardrailed freedom to reason and iterate yields the most powerful results. Erroneous autonomy can cause an agent’s behavior to go off track. Tools may be called incorrectly, LLM may overfit subtopics, and expected inference patterns may break. No single safeguard can solve all these problems.

A shift in engineering thinking is required. Treat failure modes as a core design consideration, rather than an afterthought. Simple guardrails like tool call retries and model cascades can help, but proactively anticipating anomalies and enforcing good patterns of prompting and edge case testing enable production-grade, long-running agents.

Best tools — less is more

In our experience, it is better to expose a small but important toolset to agents than a large and complex toolset. It was tempting to over-engineer by adding many tools that seemed useful in theory, but in practice this introduced new failure modes and made it difficult for LLMs to consistently select the right tools and iterate effectively.

Evarus

Although we used eval to control our development process, we were also aware of its drawbacks. LLM-as-a-judge ratings are difficult to trust. Current models are non-deterministic and the inferences are uninterpretable, which can become a bottleneck, especially for long-running agents where a single experiment takes several days to complete.

Rather than optimizing benchmark scores, we optimized directional feedback. The central question has always been: Did this change make the agent more reliable and actually usable? Evals is no longer an optimization target, but a tool to validate its direction. Intuitive and careful agent tracking monitoring consistently provided higher signal feedback than a single evaluation score. Overall, the highest numerical score is rarely the best result. For production systems, improvements such as reduced token usage, reliability, lower latency, and fewer failures are worth more than a 1 point improvement in the evaluation.

If you are interested in experiencing the results of these findings first-hand, you can sign up for early access to Tavily Research here.

versatileai

See Full Bio

What's Hot

The most cost-effective AI model ever

Google’s industrial robot AI Play makes physical AI a priority

PRX Part 3 — Train a Text-to-Image Model in 24 Hours!

The most cost-effective AI model ever

Google’s industrial robot AI Play makes physical AI a priority

PRX Part 3 — Train a Text-to-Image Model in 24 Hours!

Open Source DeepResearch – Unlocking Search Agents

Improving the accuracy of multimodal search and visual document retrieval using the Llama Nemotron RAG model

Google’s industrial robot AI Play makes physical AI a priority

Most Popular

Open Source DeepResearch – Unlocking Search Agents

Improving the accuracy of multimodal search and visual document retrieval using the Llama Nemotron RAG model

Google’s industrial robot AI Play makes physical AI a priority

Don't Miss

The most cost-effective AI model ever

Google’s industrial robot AI Play makes physical AI a priority

PRX Part 3 — Train a Text-to-Image Model in 24 Hours!

Subscribe to Updates

What's Hot

How we achieved cutting-edge technology

Building for the future

agent harness

model

tool

takeout

Context engineering — a curation exercise

Web search with context management

Modeling human-web interaction

Do more with less

Agent production — an ongoing challenge

Engineering with non-determinism

Best tools — less is more

Evarus

Related Posts