We are pleased to release Holotron-12B, a multimodal computing model from Company H. Post-trained from an open NVIDIA Nemotron-Nano-2 VL model with H Company’s proprietary data mixture, Holotron-12B is the result of close collaboration between our laboratories to design a new type of model that is primarily optimized for scale and performance in production environments.
Company H is part of the NVIDIA Inception program.
This model is now available at Hugging Face.
Most current multimodal models are primarily optimized for static vision or following instructions. However, the Holotron-12B, like the Holo2 model, has a different goal. It is to serve as a policy model for computer-using agents that need to perceive, decide, and act efficiently in interactive environments.
With Holotron-12B, we wanted to create a model that could scale efficiently and effectively in production and maintain good performance in agent benchmarks while handling long contexts with multiple images. The NVIDIA Nemotron model provides a strong foundation on the inference side, and the development of Holotron-12B demonstrated how much the model can accomplish with further training.
High-throughput inference with hybrid SSM architecture
Holotron-12B’s significant leap in inference efficiency is enabled by its underlying Nemotron architecture, which utilizes a hybrid state-space model (SSM) and attention mechanisms. Unlike purely transformer-based models, this design is optimized for high-throughput services. State-space models provide excellent scalability for long context inference by avoiding the secondary computational costs associated with full attention mechanisms, particularly benefiting agent workloads that include multiple images or long interaction histories. From an inference perspective, SSM’s main contribution is a significant reduction in memory footprint. While vanilla attention stores K and V activations per token and layer (the infamous KV cache), SSM is a linear regression model and only stores constant state per layer for each generated sequence, regardless of the length of the sequence.
As evaluated on the WebVoyager benchmark, this model outperforms using a real-world multimodal agent workload that features long contexts, multiple high-resolution images, and high request concurrency with 100 benchmark workers. Running on a single H100 GPU and using vLLM with the latest SSM optimizations (v0.14.1), Holotron-12B achieved over 2x higher throughput compared to Holo2-8B. This makes Holotron-12B an attractive choice for throughput-constrained workloads such as data generation, annotation, and online reinforcement learning.

In a controlled experimental setup (see Figure 2), Holotron-12B continued to scale efficiently as concurrency increased, with total token throughput steadily increasing to 8.9k tokens/sec at a maximum concurrency of 100. In contrast, Holo2-8B’s total token throughput plateaus much faster at 5.1k tokens/sec. This behavior highlights the main strengths of the Nemotron architecture: more effective and efficient VRAM utilization and a smaller overall memory footprint, allowing for larger effective batch sizes on the same hardware. Even with large batch sizes, Holotron-12B maintains strong throughput.

Holotron-12B Training and Evaluation
Holotron-12B was trained in two stages. We started with Nemotron-Nano-12B-v2-VL-BF16, a multimodal base model released by NVIDIA. We then performed supervised fine-tuning to H Company’s proprietary localization and navigation data mix, focusing on screen understanding, grounding, and UI-level interactions.
The last checkpoint was trained with about 14 billion tokens.
Agent benchmark
In computer usage and navigation benchmarks, Holotron-12B shows significant improvements over Nemotron-based models and shows superior performance with established agent models. WebVoyager’s performance increases from 35.1% to 80.5%, outperforming Holo2-8B’s performance in the benchmark, demonstrating the model’s ability to perform effectively in agent settings.

Localization benchmark
Holotron-12B also offers significant improvements over the base Nemotron model when it comes to localization and grounding benchmarks such as OS-World-G, GroundUI, and WebClick.

Holotron-12B shows that the NVIDIA Nemotron VL model, when combined with appropriate training settings and infrastructure work, provides a strong foundation for real-world multimodal agents.
This model provides strong agent performance, significantly increased inference throughput, and a clear path to future improvements, especially regarding high-resolution vision training.
I’m looking forward to seeing what others build with the Holotron-12B. Models and checkpoints are currently available on Hugging Face under the NVIDIA Open Model License.
NVIDIA today announced the release of Nemotron 3 Omni. Building on the success of Holotron-12B, we are preparing to post-train this next generation multimodal model. By leveraging the enhanced hybrid SSM-tention and MoE architectural foundations of the Nemotron 3 family, we aim to deliver further leaps in inference capabilities and multimodal accuracy with the newly announced Nemotron 3 Omni. This evolution will push Holotron beyond research and into commercial applications, providing enterprises with the high-throughput, low-latency performance needed for large-scale autonomous “compute-enabled” deployments.

