Compact multimodal intelligence for corporate documents

Today, we are pleased to announce Granite 4.0 3B Vision, a compact vision language model (VLM) designed for enterprise document understanding. Built to extract reliable information from complex documents, forms, and structured visuals. Granite 4.0 3B Vision excels in the following features: Table extraction: Accurately parse complex table structures (multi-row, multi-column, etc.) from document images. Graph Understanding: Convert graphs and diagrams into structured, machine-readable format, summaries, or executable code. Semantic key-value pair (KVP) extraction: Identify and base semantically meaningful key-value field pairs across diverse document layouts.

This model ships as a LoRA adapter on Granite 4.0 Micro, a dense language model that modularizes vision and language for text-only fallback and seamless integration into mixed pipelines. It continues to support visual language tasks such as generating detailed natural language descriptions from images (e.g. “Describe this image in detail”). This model can be used standalone or in conjunction with Docling to power your document processing pipeline with deep visual understanding capabilities.

How to build Granite 4.0 3B Vision

Granite 4.0 3B Vision’s performance is the result of three major investments. A dedicated chart understanding dataset built with a new code-guided data augmentation approach, a new variant of the DeepStack architecture that enables the insertion of high-detail visual features, and a modular design that keeps models practical for enterprise deployments.

ChartNet: Educating models to truly understand charts

Charts pose a challenge for visual language models (VLMs). Because understanding charts requires joint reasoning between visual patterns, numerical data, and natural language. These combinations are poorly handled by most VLMs, especially when spatial accuracy is important, such as reading accurate values from a line graph. To fill this gap, we developed ChartNet. This is a million-scale multimodal dataset built for chart interpretation and inference and will be detailed in an upcoming CVPR 2026 paper.

ChartNet uses a code-guided synthesis pipeline to generate 1.7 million diverse chart samples across 24 chart types and six plotting libraries (see Figure 1). What makes this so unique is that each example is made up of five aligned components (plotting code, rendered images, data tables, natural language summaries, and QA pairs), giving the model a deep, cross-modal view of not only the appearance of the graph, but also the meaning of the graph. This dataset also includes a human-annotated subset of the real world, filtered for visual fidelity, semantic accuracy, and diversity.

The result is a training resource that allows VLM to move from simply chart descriptions to a true understanding of the structured information encoded, with consistent improvements across model sizes, architectures, and tasks.

Figure 1: ChartNet’s synthetic data generation pipeline.

DeepStack: Inserting smarter visual features

Most VLMs inject visual information into the language model in one place, requiring the model to handle both high-level semantics and fine spatial details simultaneously. Granite 4.0 3B Vision takes a different approach with DeepStack Injection. Abstract visual features are routed to earlier layers for semantic understanding, and high-resolution spatial features are fed to later layers to preserve details. The result is a model that understands both the content and location of documents. This is important for tasks such as table extraction, graph understanding, and KVP analysis, where layout is as important as content. For a complete technical breakdown, see the Model Architecture section of the model card.

Modularity: one model, two modes

Granite 4.0 3B Vision is packaged as a LoRA adapter on Granite 4.0 Micro rather than as a standalone model. In practice, this means that you can support both multimodal and text-only workloads in the same deployment, and automatically fall back to the base model when vision is not required. This allows for easy enterprise integration without sacrificing performance.

structure

Chart: When evaluated on the human-verified ChartNet benchmark using LLM as the judge, Granite 4.0 3B Vision achieved the highest Chart2summary (86.4%) score of all evaluated models, including significantly larger models (see Figure 2). It also ranks second in Chart2CSV (62.1%) behind Qwen3.5-9B (63.4%), a model more than twice its size.

Figure 2: Performance of Granite 4.0 3B Vision on chart2csv and chart2summary compared to peer vision language model using LLM-as-a-judge.

Tables: We evaluate table extraction in two settings: cropped tables (isolated regions) and full-page documents (tables embedded in complex layouts) (see Figure 3). The benchmark suite includes TableVQA-extract (cropped table images), OmniDocBench-tables (full-page documents), and PubTables-v2 (both cropped and full-page configurations). The model is tasked with extracting tables in HTML format and scored using TEDS, a metric that captures both structural and content correctness. Granite 4.0 3B Vision achieved the strongest performance across benchmarks, outperforming PubTablesV2 in both crop (92.1) and full page (79.3) scores, OmniDocBench (64.0), and TableVQA (88.1) scores among all evaluation models.

Figure 3: Granite 4.0 3B Vision table extraction performance across cropped and full-page benchmarks (TableVQA extraction, PubTables-v2, OmniDocBench tables) measured by TEDS.

Semantic KVP: VAREX is a benchmark specifically designed to differentiate small extraction models and consists of 1,777 US government forms ranging from simple flat layouts to complex nested tabular structures. The model is evaluated using exact matching (EM). This is a strict metric that requires the key-value pairs extracted from the model to match the ground truth. Granite 4.0 3B Vision achieves 85.5% EM accuracy zero shot.

How to use

Granite 4.0 3B Vision can operate as a standalone visual information extraction engine or as part of a fully automated document processing pipeline using Docling. This model is designed to support scalable and accurate extraction across a variety of document types and visual formats.

1. About Standalone Images Because Granite 4.0 3B Vision can be run directly on individual images, this option is useful for applications with existing workflows that require targeted visual extraction without changing upstream systems. This makes it easy to integrate into existing automation workflows and is suitable for lightweight, task-specific tools (form parsers, chart analyzers, etc.).

2. Integrated Document Understanding Pipeline with Docling Granite 4.0 3B Vision can also be seamlessly integrated with Docling, supporting complete end-to-end document understanding. This mode allows you to:

Process multi-page PDFs at scale Automatically discover figures, tables, and other visual elements with docking, segment, crop, and redirect clean crops to Granite Vision models for fine-grained extraction Efficient workflows with lower overall computational costs and faster throughput Greater accuracy, more reliable extraction, and significantly improved efficiency across large document collections

Example usage example

Form processing: Use KVP functionality to extract structured fields from invoices, forms, and receipts, or use image2text functionality to generate natural language descriptions of diagrams (e.g. “Describe this image in detail”). Financial report analysis: Use Docling to parse reports, discover diagrams, and crop visual elements. Process charts using Granite Vision’s chart2csv, chart2code, and tables using tables_json functionality to transform them into structured, machine-readable data that enables actionable insights. Research Document Intelligence: Leverage Docling to handle OCR and layout parsing across dense academic PDFs, passing extracted figures to chart2summary and table cropping to tables_html to discover visual content along with free-form text in a single pipeline.

Try it now

Granite 4.0 3B Vision is released under the Apache 2.0 license and is available now on HuggingFace. Complete technical details, training methods, and benchmark results are available on the model card. We’d love to hear what you’ve built with it. Please share your feedback in the community tab.

versatileai

See Full Bio

What's Hot

Why Five Eyes spy agencies warn they will be hit by AI cyber threats this year

OCR parameters for 50 languages from 1.5 million to 34.5 million

e2e-assure introduces Cumulo, the UK’s only sovereign AI-driven zero-day SOC platform for securing IT and OT environments

Why Five Eyes spy agencies warn they will be hit by AI cyber threats this year

OCR parameters for 50 languages from 1.5 million to 34.5 million

e2e-assure introduces Cumulo, the UK’s only sovereign AI-driven zero-day SOC platform for securing IT and OT environments

Can research agents keep secrets?

Computer vision helps retailers improve productivity

Model Development Loop Evaluation Workbench

Most Popular

Can research agents keep secrets?

Computer vision helps retailers improve productivity

Model Development Loop Evaluation Workbench

Don't Miss

Why Five Eyes spy agencies warn they will be hit by AI cyber threats this year

OCR parameters for 50 languages from 1.5 million to 34.5 million

e2e-assure introduces Cumulo, the UK’s only sovereign AI-driven zero-day SOC platform for securing IT and OT environments

Subscribe to Updates

What's Hot

Compact multimodal intelligence for corporate documents

How to build Granite 4.0 3B Vision

ChartNet: Educating models to truly understand charts

DeepStack: Inserting smarter visual features

Modularity: one model, two modes

structure

How to use

Try it now

Related Posts