How NVIDIA AI-Q reached #1 on DeepResearch Bench I and II

Contributors: Raja Biswas, Divyansh Jain, Ivan Sorokin, Alessio Devoto, Chantal D Gama Rose, Ajay Thorve, David Austin, Jean-Francois Puget

NVIDIA AI-Q Deep Research Agent recently ranked #1 in both DeepResearch Bench (55.95) and DeepResearch Bench II (54.50), two leading benchmarks for evaluating deep research agents. This is a significant step toward open and portable deep research. One composable stack that leads both shows that the models and tools developers have access to can drive cutting-edge agent research.

What’s special about AI-Q? AI-Q is an open blueprint for building AI agents that reason with enterprise and web data and provide well-cited responses. AI-Q provides a completely open, modular architecture that enterprises can own, inspect, customize, and configure for each use case. Deep Researcher is one workflow within a larger AI-Q blueprint that includes intent routing, query disambiguation, and shallow research. Deep Researcher employs a multi-agent architecture consisting of a planner, researcher, and orchestrator built on the NVIDIA NeMo Agent Toolkit and a fine-tuned NVIDIA Nemotron 3 Super model, with optional ensembles and report refiners to maximize report quality. One stack – flexible design that can be adjusted to suit your needs.

Why beating both benchmarks is important

DeepResearch Bench I and II evaluate research agents in complementary ways.

The DeepResearch Bench score evaluates the quality of a report compared to a reference report along the dimensions of comprehensiveness, depth of insight, followability of instructions, and readability. If you do well here, you’ll end up with a sophisticated, well-constructed narrative and a strong overall strength.

DeepResearch Bench II uses over 70 fine-grained binary rubrics for each task to check whether the agent retrieves the correct information (information recall), integrates it into higher-level analysis (analysis), and clearly presents the results (presentation). If you do well here, detailed factual accuracy and analytical rigor will reward you.

Outperforming on both benchmarks means that AI-Q’s deep researchers produce well-cited, sophisticated reports and get the underlying search and inferences right.

Architecture overview

The AI-Q deep researcher architecture behind both results is centered around three components: an orchestrator who orchestrates the research loop, a planner who maps the information landscape and designs an evidence-based research plan, and a researcher who dispatches parallel experts to collect and synthesize evidence across multiple analytical lenses. Each agent can operate using a different LLM. Optional ensemble runs multiple agents in parallel and merges their output to maximize report quality and scope of information. Figure 1 shows the complete architecture.

Figure 1. AI-Q Deep Researcher: Orchestrator, planner, and researcher pipeline (right) and optional ensemble (left).

Core stack: NVIDIA and Deep Research

The same underlying stack powers both leaderboard submissions. That means it’s open, reproducible, and built on:

NVIDIA NeMo agent toolkit for workflow wiring, feature registration, and evaluation. The NeMo Agent Toolkit open source library provides configuration-driven configuration of LLM and tools, as well as the ability to plug in different agent graphs. LangChain DeepAgents enables multi-phase planner, researcher, and orchestrator flows with optional subagent middleware. NVIDIA Nemotron 3 LLM to power your agent pipeline. Nemotron models can be fine-tuned for superior performance for research synthesis and long-term tool calls. Can be provided via NVIDIA Build or NVIDIA NIM for model inference.

The focus is always on multi-stage research (planning → collection → synthesis), web searches (Tavily) and scholarly article searches (Serper), and citation-supported reporting. Optionally, you can add ensemble layers and report filters on top to maximize report quality.

Main ingredients of AI-Q

Four factors were central to the results.

Evidence-based planning and expert researcher-powered multi-agent architecture built on the NVIDIA NeMo Agent Toolkit and LangChain DeepAgents. Fine-tuned NVIDIA Nemotron 3 Super: Approximately 67,000 SFT trajectories from a small seed dataset containing research questions were filtered using principled judgment. This model empowers researchers and their subagents. Custom middleware for long-term reliability. NeMo Agent Toolkit and LangChain middleware have been enhanced with components that improve reliability and robustness. Ensemble Researcher and Report Refiner (optional): Parallel pipeline outputs merged by LLM with post-hoc refiner to maximize report quality.

Each is explained in detail in the following sections.

Fine-tuned NVIDIA Nemotron 3 Super: Data and Training

The main driver of the results is the custom-tuned NVIDIA Nemotron-3-Super-120B-A12B model. We chose it for this workflow because it is well suited for multi-step agent inference, tool usage, and citation-based reporting. By fine-tuning the actual search and synthesis trajectory, we can effectively fulfill the role of planner, researcher, and orchestrator at scale.

Trajectory generation

We collected our research questions from multiple open source datasets. Approximately 17,000 questions from OpenScholar, approximately 21,000 questions from ResearchQA, and 2,457 questions from Fathom-DeepResearch-SFT. We then used the open source GPT-OSS-120B model to generate approximately 80,000 trajectories for the complete workflow. Each trajectory covers the actions of planners, researchers, and orchestrators. It is worth noting that these trajectories contain real web search results from the Tavily and Serper APIs, so the model learns how to navigate and perform multi-step searches and synthesis on real data.

Principle-based filtering

Most trajectories were stopped because they did not complete on time or exceeded tool call limits, but for those that yielded the expected results, additional filtering was applied using the decision model. Completed trajectories are scored using the nvidia/Qwen3-Nemotron-32B-GenRM-Principle judge model to predict quality along aspects such as comprehensiveness, readability, accuracy, and relevance. After filtering, approximately 67,000 trajectories were retained for training.

SFT training

Model: NVIDIA Nemotron-3-Super-120B-A12B Setup: 1 epoch, 5,615 steps, approximately 25 hours on 16×8 NVIDIA H100 GPU.

AI-Q Deep Researcher

Deep AI-Q researchers employ multi-agent architectures (orchestrators, planners, and researchers) with iterative planning → collect → synthesis loops, citation management, and custom middleware to achieve long-term reliability. Optional ensemble and report refinement layers can be enabled to maximize report quality. Multi-agent design also works as a long-context strategy. The orchestrator never sees the raw tool responses because each subagent operates within its own context window and returns only its synthesized output. This focuses on the context of the orchestrator and prevents inference from being degraded by long and noisy search results.

Orchestrator: Coordinates the entire research loop. Call planners to create an evidence-based research plan, then call researchers multiple times to carry out focused research tasks derived from that plan. Once the research is complete, the orchestrator reviews the plan’s quality constraints, dispatches targeted gap-filling research, and writes a long report. An optional refinement step leverages the raw researcher brief in a new context window to edit the report (second evidence recovery point).

Planner: Runs in two phases. The Scout subagent first maps the information landscape through a broad search. The architect subagent then designs a research plan, including a report summary, targeted search queries, and quality constraints, and performs its own searches to validate the structure selection.

Evidence-based planning is key to producing reliable, high-quality reports. Our planners take stock of the information before committing to a structure. Decide where to investigate deeply and broadly based on what you actually discover, not assumptions.

Researchers: Deploy multiple specialized sub-agents in parallel, each with a different lens.

Evidence Gatherers: Facts, statistics, specific numbers from reliable sources Mechanism Explorers: Causal explanations, theoretical frameworks Comparators: Benchmarks, direct data, trade-off analysis Critics: Counterarguments, limitations, failure stories Horizon Scanner: Recent developments, emerging trends

Although they share the same search tools, they have different analysis frameworks. Diverse experts studying the same topic often uncover evidence that a single generalist would miss.

Researchers synthesize experts’ findings into a unified, cited summary. The LLM then matches this synthesis with the raw expert output in a new context window to recover relevant information.

Configuration-driven flexibility
All components are replaceable. LLM, tools, and agent graphs can be configured through YAML. Different LLMs are available for planners, researchers, and orchestrators. For benchmark submissions, a fine-tuned Nemotron 3 powers researchers and processes four times the amount of tokens as planners and orchestrators combined.

Custom middleware for long-term reliability

Each agent and subagent interleaves LLM and tool calls over many steps (often 32 or more). At that scale, the system can fail in ways that are never apparent in short interactions. Our agent harness provides custom middleware to handle and mitigate these.

Sanitizing tool names: LLM can hallucinate tool names during execution. This middleware applies pattern-based cleaning, alias resolution, and fuzzy matching to recover the desired tool. Reasoning-aware retries: LLMs with reasoning may generate thought tokens without tool calls or final responses. This silently exits the agent loop. The middleware detects this, keeps the inference in context, and tries again. Budget enforcement: Each agent and subagent has its own tool call limit. When the limit is reached, the middleware first prompts to synthesize the LLM, then removes the tool entirely and forces a text-only response. Report validation: Before returning output, the middleware checks minimum length and section structure. Incomplete reports will be retried with a continuation prompt.

Each middleware addresses failure patterns observed in agent traces. Together, these provide reliability over long distances.

ensemble
When enabled, N independent deep research pipelines run in parallel. LLM reads all N outputs, chooses one as a structure base, and integrates unique content from other outputs. This ensemble generates a broader scope of evidence than a single pipeline, directly increasing comprehensiveness and information recall. The proofing pass removes process artifacts so the output reads as the work of a single author.

post hock refiner
An optional final report refinement step allows you to run structured instructions on your report to quantify ambiguous claims, deepen entity coverage, reduce footing and ground risk, create comparison tables, and enhance causal inference. Rewrite prompts are derived via self-supervised meta-learning on reference reports generated from a pipeline using only Frontier LLM.

takeout

NVIDIA AI-Q took first place in both Deep Research Bench and Deep Research Bench II in a single stack. It is a multi-agent deep researcher built on the NVIDIA NeMo agent toolkit, fine-tuned NVIDIA Nemotron 3 models, and custom middleware, with optional ensembles and refiners if you want maximum report quality. The stack is open, reproducible, and configurable to suit your needs. Get cutting-edge results without sacrificing transparency or control.

Join us at NVIDIA GTC in San Jose the week of March 16, 2026 to learn more.

S81706 – Reputation-Driven Development: Best Practices for Building Reliable Agents DLIT81725 – Developing Production Agents with Reputation-Driven Design Dhruv Nandakumar US S81570 – From Data to Decisions: Enabling AI Agents with Business Knowledge S81569 – Self-Coding Agents: Architecture, Data Flywheel, and Autonomous Code Repair S81789 – Open Source AI Intelligent Digital Shaping the next era of workers

versatileai

See Full Bio

What's Hot

What is optical interconnect and why Lightelligence’s $10 billion debut claims it’s important for AI

Adaptive ultrasound imaging with physics-based NV-Raw2Insights-US AI

A billion-dollar startup with different ideas for AI

What is optical interconnect and why Lightelligence’s $10 billion debut claims it’s important for AI

Adaptive ultrasound imaging with physics-based NV-Raw2Insights-US AI

A billion-dollar startup with different ideas for AI

The most comprehensive evaluation suite for GUI agents!

Diffusers welcome stable spread 3

Disney invests $1 billion in OpenAI, licenses over 200 characters for Sora AI tool

Most Popular

The most comprehensive evaluation suite for GUI agents!

Diffusers welcome stable spread 3

Disney invests $1 billion in OpenAI, licenses over 200 characters for Sora AI tool

Don't Miss

What is optical interconnect and why Lightelligence’s $10 billion debut claims it’s important for AI

Adaptive ultrasound imaging with physics-based NV-Raw2Insights-US AI

A billion-dollar startup with different ideas for AI

Subscribe to Updates

What's Hot

How NVIDIA AI-Q reached #1 on DeepResearch Bench I and II

Why beating both benchmarks is important

Architecture overview

Core stack: NVIDIA and Deep Research

Main ingredients of AI-Q

Fine-tuned NVIDIA Nemotron 3 Super: Data and Training

AI-Q Deep Researcher

takeout

Related Posts