Modern search systems are increasingly designed to handle heterogeneous document images that may include text, tables, graphs, diagrams, and other visual components. In this context, accurately retrieving relevant information across these diverse modalities is a central challenge. Multimodal embedding models built on the foundational Vision Language Model (VLM) map different content types into a shared representation space and enable integrated search of text, images, and structured visual elements. Although encoding the entire query and candidate document into a single vector is a common technique, efficiency and low storage are prioritized, and research is increasingly directed toward multivector late interaction style embedding architectures that provide fine-grained multivector interactions between queries and documents, as exemplified by the recently released commercially ready Llama-Nemotron-Embed-VL-1B. By allowing richer token representations, these models better capture more detailed semantic relationships and show higher accuracy performance on a variety of (multimodal) benchmarks.
NVIDIA announces the Nemotron ColEmbed V2 family, a set of delayed interaction embedding models available in three sizes: 3B, 4B, and 8B designed for high-precision multimodal search. These models take a unified approach to text and image retrieval and deliver state-of-the-art performance on the ViDoRe V1, V2, and V3 benchmarks.
Nemotron ColEmbed V2 Highlights (TL;DR)
nemotron-colembed-vl-8b-v2, nemotron-colembed-vl-4b-v2, and llama-nemotron-colembed-vl-3b-v2 are state-of-the-art delayed interaction embedding models that rank 1st, 3rd, and 6th in their respective weight classes on the ViDoRe V3 benchmark as of February 3, 2026. Evaluating visual document search for enterprise use case benchmarks.

The late interaction mechanism introduced by ColBERT for multivector embedding matching was extended to a multimodal setting in our work, allowing fine-grained interaction between queries and document tokens, whether textual or visual. As shown in the diagram, each query token embedding interacts with all document token embeddings through the MaxSim operator. This operator selects the maximum similarity for each query token and sums these maximum values to produce the final relevance score. This approach increases storage requirements as it requires storing token embeddings for the entire corpus of documents, whether textual or visual. During inference, the query token’s embedding is computed and matched against the document’s embedding saved using the same MaxSim operation.
The Nemotron ColEmbed V2 family of models is aimed at researchers considering visual document retrieval applications where accuracy is paramount. This is different from the 1B single vector model released last month, which was designed for commercial environments that require minimal storage and high throughput. This is useful for multimodal RAG systems where text queries can be used to retrieve document images such as pages, text, charts, tables, and infographics. This model outputs a multivector embedding of the input query and document. Potential applications include multimedia search engines, cross-modal search systems, and conversational AI with rich input understanding capabilities.
As a new benchmark, ViDoRe V3 is designed to set the industry standard for multimodal enterprise document search. It addresses a key challenge in operational RAG systems: accurately extracting information from complex and visually rich documents. The nemotron-colembed-vl-8b-v2 model ranks first on the ViDoRe V3 leaderboard due to its powerful multimodal document search capabilities, setting a new standard for accuracy.
Visual Document Retrieval Benchmark (Page Retrieval) – Average NDCG@10 on public and private tasks in ViDoRe V3.
Model architecture
llama-nemotron-colembed-vl-3b-v2 is a transformer-based multimodal embedding model built on VLM based on google/siglip2-giant-opt-patch16-384 and metal-llama/Llama-3.2-3B. The nemotron-colembed-vl-8b-v2 and nemotron-colembed-vl-4b-v2 multimodal encoder models were built from Qwen3-VL-8B-Instruct and Qwen3-VL-4B-Instruct, respectively.
Architecture changes:
Our model uses bidirectional self-attention instead of the original unidirectional causal self-attention in the LLM decoder model. This allows the model to learn rich representations from the entire input sequence. ColBERT-style lazy interaction mechanism – For each input token, each model outputs an n-dimensional embedding vector of floating point values. Here, n is determined by the hidden size of the model.
training methodology
The nemotron-colembed-vl-8b-v2, nemotron-colembed-vl-4b-v2, and llama-nemotron-colembed-vl-3b-v2 models were trained separately using a biencoder architecture. This involves independently encoding a pair of sentences (such as a query and a document) using an embedding model. Contrastive learning can be used to maximize the similarity of late interactions between the query and the documents that contain the answer, while minimizing the similarity between the query and sampled negative documents that do not help answer the question.
The llama-nemotron-colembed-vl-3b-v2 model was trained in a two-stage pipeline. It was first fine-tuned with 12.5 million textQA pairs, followed by text-image pairs. The nemotron-colembed-vl-8b-v2, nemotron-colembed-vl-4b-v2 models were fine-tuned using only text-image pairs (second stage).
Our training dataset contains both text-only and text-image pairs, and we apply hard negative mining to improve search performance following the positive-aware hard negative mining technique presented in the NV-Retriever paper.
✨ Main improvements from V1:
⚗️ Advanced model merging: Combine the strengths of multiple fine-tuned checkpoints with post-training model merging. This provides stability in ensemble accuracy without adding inference delay.
🌍 Enhanced synthetic data: We significantly enriched our training mixture with diverse multilingual synthetic data to improve semantic consistency across languages and complex document types.

Start building with Nemotron ColEmbed V2
The Nemotron ColEmbed V2 model represents a major breakthrough in high-precision text and image retrieval, delivering state-of-the-art results on ViDoRe V1, V2, and V3 benchmarks. The availability of 3B, 4B, and 8B model variations further establishes a solid foundation for future research and advanced experimentation in multimodal search applications.
To start using the Nemotron ColEmbed V2 models, download the models nemotron-colembed-vl-8b-v2, nemotron-colembed-vl-4b-v2, and llama-nemotron-colembed-vl-3b-v2 available on Hugging Face. To learn more about the NVIDIA NeMo Retriever family of Nemotron RAG models, visit our product page or visit Microservices Containers from NVIDIA NGC. This is a great opportunity to explore cutting-edge acquisition in your own applications and workflows.
Try the NVIDIA Enterprise RAG Blueprint using the Nemotron RAG model, which features the same technology that made ViDoRe V3 award-winning.

