Improving the accuracy of multimodal search and visual document retrieval using the Llama Nemotron RAG model

Bo Liu's avatar

How to build accurate, low-latency visual document retrieval using small, ready-to-use Llama Nemotron models in standard vector databases.

In real-world applications, data is more than just text. It exists in PDF format with graphs, scanned contracts, tables, screenshots, and slide decks, so text-only search systems miss important information. Multimodal RAG pipelines change this by allowing text, images, and layout to be searched and inferred together to yield more accurate and actionable answers.

This post describes two small Llama Nemotron models for multimodal search of visual documents.

Both models are:

Small enough to run on most NVIDIA GPU resources Compatible with standard vector databases (one dense vector per page) Designed to reduce hallucinations by generating based on better evidence rather than long prompts

Below we show how they perform in a benchmark of realistic documents.

Why multimodal RAGs need world-class search

The multimodal RAG pipeline combines retrieval functionality with a vision language model (VLM), so responses are based on both retrieved page text and visual content, not just the raw text prompt.

Embedding controls which pages are retrieved and displayed in VLM. A re-ranking model determines which of these pages are most relevant and will influence your answer. If either step is inaccurate, the VLM is likely to hallucinate (often with high confidence). Multimodal embedding, when used in conjunction with the multimodal reranker, maintains generation based on the correct page images and text.

The cutting edge of commercial multimodal search

The llama-nemotron-embed-vl-1b-v2 and llama-nemotron-rerank-vl-1b-v2 models are designed for developers building multimodal question answers and searching large corpora of PDFs and images.

The llama-nemotron-embed-vl-1b-v2 model is a single vector (dense) embedding model that efficiently condenses visual and textual information into a single representation. This design ensures compatibility with all standard vector databases and enables enterprise-scale millisecond delay searches.

llama-nemotron-rerank-v1-1b-v2 is a cross-encoder reranking model that reorders the top searched candidates to increase their relevance and improve downstream answer quality without changing storage or index format.

We evaluated llama-nemotron-embed-vl-1b-v2 and llama-nemotron-rerank-vl-1b-v2 on five visual document retrieval datasets. A realistic visual document search benchmark for enterprises consisting of the popular ViDoRe V1, V2, V3, 8 public datasets, and 2 internal visual document search datasets.

DigitalCorpora-10k: A dataset containing over 1,300 questions based on DigitalCorpora’s corpus of 10,000 documents with a good mix of text, tables, and charts. Earnings V2: An internal search dataset of 287 questions based on 500 PDFs, mostly consisting of earnings reports from large technology companies.

Visual document search (page search) benchmark

The table below shows the average retrieval accuracy (Recall@5) across five datasets, with a particular focus on commercially viable dense retrieval models.

We find that llama-nemotron-embed-vl-1b-v2 has better retrieval accuracy (Recall@5) for image and image + text modalities than the previous llama-3.2-nemoretriever-1b-vlm-embed-v1, and also outperforms the small text embedding model llama-nemotron-embed-1b-v2 in terms of text modality. Finally, the VLM reranker llama-nemotron-rerank-vl-1b-v2 further improves search accuracy by 7.2%, 6.9%, and 6% for each modality.

Note: Image + Text modality means that both the page image and its text (extracted using an ingestion library such as NV-Ingest) are fed to the embedding model as input for more accurate representation and retrieval.

Visual Document Retrieval Benchmark (Page Retrieval) – Average Recall@5 on DigitalCorpora-10k, Earnings V2, ViDoRe V1, V2, V3

Model Text Image Image + Text llama-nemotron-embed-1b-v2 69.35% – – llama-3.2-nemoretriever-1b-vlm-embed-v1 71.07% 70.46% 71.71% llama-nemotron-embed-vl-1b-v2 71.04% 71.20% 73.24% Llama-nemotron-embed-vl-1b-v2 + Llama-nemotron-rerank-vl-1b-v2 76.12% 76.12% 77.64%

The table below shows the accuracy evaluation of llama-nemotron-rerank-vl-1b-v2 compared to two other publicly available multimodal reranker models: jina-reranker-m0 and MonoQwen2-VL-v0.1. jina-reranker-m0 works well for image-only tasks, but its public weights are restricted to non-commercial use (CC-BY-NC). In contrast, llama-nemotron-rerank-vl-1b-v2 offers superior performance across text and image-text combination modalities, and its permissive commercial license makes it an ideal choice for enterprise deployments.

Model Text Image Image+Text llama-nemotron-rerank-vl-1b-v2 76.12% 76.12% 77.64% jina-reranker-m0 69.31% 78.33% NA MonoQwen2-VL-v0.1 74.70% 75.80% 75.98%

Architecture highlights and training methods

The llama-nemotron-embed-vl-1b-v2 embedding model is a transformer-based encoder model with approximately 1.7B parameters. This is a fine-tuned version of the NVIDIA Eagle family of models using the Llama 3.2 1B language model and SigLip2 400M vision encoder. Embedded models for search are typically trained using a biencoder architecture that encodes queries and documents separately. This model applies average pooling to the output token embeddings from the language model and outputs a single embedding of 2048 dimensions. We use contrastive learning to train an embedding model to increase the similarity between the query and related documents and reduce the similarity to negative samples.

llama-nemotron-rerank-vl-1b-v2 is a cross encoder model with approximately 1.7B parameters. This is also a tweaked version of the NVIDIA Eagle family of models. The hidden states of the final layer of the language model are aggregated using an average pooling strategy, and the binary classification head is fine-tuned for the ranking task. The model was trained with CrossEntropy loss using publicly available, synthetically generated datasets.

How organizations are using these models

Here are three examples of how organizations are applying the new Nemotron embedding and reranking model that can be adapted to their own systems.

Cadence: Design and EDA Workflow
Cadence models logic design assets such as microarchitectures, specifications, constraints, and verification materials as connected, multimodal documents. As a result, engineers can ask, “I want to extend my interrupt controller to support low-power states. Please indicate which section of the specification I need to change?” and instantly uncover the most relevant requirements. The system then suggests several alternative specification update strategies, compares their tradeoffs, and generates a specification edit that corresponds to the option selected by the user.

IBM: Domain-intensive storage and infrastructure documentation
IBM Storage treats each page of a long PDF (product guide, configuration manual, architecture diagram) as a multimodal document, embeds it, and uses a reranker to prioritize pages where domain-specific terms, acronyms, and product names appear in the correct context before sending it to downstream LLM. This improves the way AI systems interpret storage concepts and reason about complex infrastructure documentation.

ServiceNow: Chat using large sets of PDFs
ServiceNow uses multimodal embedding to index pages from your organization’s PDFs and applies a reranker to select the most relevant pages for each user query in the Chat with PDF experience. By keeping high-scoring pages in context across turns, agents maintain a more consistent conversation and help users navigate large document collections more effectively.

Let’s get started

You can try the model directly.

Running llama-nemotron-embed-vl-1b-v2 on the vector database of your choice enhances multimodal search for PDFs and images. Add llama-nemotron-rerank-vl-1b-v2 as a second stage reranker to top-k results to improve search quality without changing the index. If you need an end-to-end component for your agent, download the Nemotron RAG model. Models are not limited to standalone use; they can also be integrated into ingestion pipelines.

Connect your new model to your existing RAG stack or combine it with other open models on Hugging Face to build multimodal agents that understand PDFs as well as extracted text.

Stay up to date on NVIDIA Nemotron by subscribing to NVIDIA News and following NVIDIA AI on the Nemotron channel on LinkedIn, X, YouTube, and Discord.

versatileai

See Full Bio

What's Hot

DeepInfra on Hug Face Inference Provider 🔥

How enterprise AI governance ensures profit margins

Guide to APIs, MCPs, and MCP Gateways

DeepInfra on Hug Face Inference Provider 🔥

How enterprise AI governance ensures profit margins

Guide to APIs, MCPs, and MCP Gateways

Soulgen revolutionizes the creation of NSFW content

Trump’s “big beautiful bill” could ban AI regulations

Diffuser welcomes Stable Diffusion 3.5 Large

Most Popular

Soulgen revolutionizes the creation of NSFW content

Trump’s “big beautiful bill” could ban AI regulations

Diffuser welcomes Stable Diffusion 3.5 Large

Don't Miss

DeepInfra on Hug Face Inference Provider 🔥

How enterprise AI governance ensures profit margins

Guide to APIs, MCPs, and MCP Gateways

Subscribe to Updates

What's Hot

Improving the accuracy of multimodal search and visual document retrieval using the Llama Nemotron RAG model

How to build accurate, low-latency visual document retrieval using small, ready-to-use Llama Nemotron models in standard vector databases.

Why multimodal RAGs need world-class search

The cutting edge of commercial multimodal search

Visual document search (page search) benchmark

Architecture highlights and training methods

How organizations are using these models

Let’s get started

Related Posts