15 min to read
Modern documents mix text, tables, charts, and scanned pages. Many teams want to extract this content on local hardware for privacy reasons. IBM Granite 4.0 3B Vision is a compact vision-language model for this need.
It focuses on chart, table, and key-value extraction while keeping hardware demands moderate.
IBM Granite 4.0 3B Vision is a 3‑billion‑parameter vision-language model for document data extraction. A vision-language model, or VLM, can read images and text together and answer questions or output structured data.
This model targets enterprise documents with charts, tables, forms, and complex layouts. It is available as a LoRA adapter on top of the Granite 4.0 Micro language model and uses an Apache 2.0 license.
In Granite 4.0 3B Vision, LoRA layers cover attention and MLP blocks so the base model can serve both text and multimodal tasks. This design keeps deployment flexible while keeping the model size moderate for local use.
The model focuses on three main jobs:
hart2csv>, hart2summary>, and hart2code> trigger chart‑specific outputs such as data tables, text summaries, or Python plotting scripts.<tables_json>, <tables_html>, and <tables_otsl> return structured table outputs with rows, columns, and merges.Use a Linux or Windows machine with a recent NVIDIA GPU. For comfortable speed, aim for a GPU with 8–12 GB of VRAM, such as an RTX 3060. The model can run on CPU, but performance drops and is best only for tests. Ensure that Python 3.10 or newer and Git are installed.
Install the latest NVIDIA driver that matches your GPU. Install the CUDA toolkit and cuDNN that match your driver and PyTorch version. Follow NVIDIA’s platform guide for your OS to avoid version conflicts.
Create a virtual environment so Granite 4.0 3B Vision and its dependencies stay separate from other projects. For example, create a venv directory and activate it, then install core Python packages:
pip install --upgrade pippip install vllm openai huggingface_hub pillowvLLM is a high‑performance inference engine that supports image and LoRA features for Granite 4.0 3B Vision. The openai package provides a simple client for the HTTP API.
Granite 4.0 3B Vision is hosted on Hugging Face under the ID ibm-granite/granite-4.0-3b-vision. You can let vLLM download weights on first run, so no manual clone is required. If you prefer a full local copy, run:
git lfs installgit clone https://huggingface.co/ibm-granite/granite-4.0-3b-visionThe repository includes example images and a vLLM integration script.
The Hugging Face repo includes a start_granite4_vision_server.py script that wires Granite 4.0 3B Vision into vLLM. This script exposes an OpenAI‑style endpoint on a configurable host and port. A typical start command looks like this:
python start_granite4_vision_server.py \--model ibm-granite/granite-4.0-3b-vision \--trust_remote_code --host 0.0.0.0 --port 8000 \--hf-overrides '{"adapter_path": "ibm-granite/granite-4.0-3b-vision"}'This setup lets vLLM load the Granite 4.0 Micro base and apply the vision LoRA per request. Text‑only prompts use the base model, while image prompts trigger the vision path.
IBM also maintains GGUF‑encoded variants of Granite models and conversion scripts for llama.cpp and Ollama.
GGUF is a file format for quantized models that fit on smaller GPUs or CPUs. For Granite 4.0 3B Vision, the IBM GGUF repository includes configuration entries and scripts for local testing with an Ollama server on macOS.
To use this route, convert the model to GGUF using the IBM tools and load it in an Ollama configuration file. Expect some feature gaps compared to the full vLLM path, especially around advanced multimodal features.
Granite 4.0 3B Vision exposes a chat completion style interface. Each request includes a list of messages with roles like user and assistant.
The user message contains two parts: an image and a control tag text string. The image is passed as a URL or as base64 data, and the text holds a tag such as hart2csv> or <tables_json>.
With the OpenAI Python client, a request has this structure:
base_url to http://localhost:8000/v1 and an arbitrary API key.content is a list with an image_url item and a text item that holds the tag.client.chat.completions.create() with model="ibm-granite/granite-4.0-3b-vision" and read the text from the first choice.The Hugging Face model card provides a complete code example for chart and table tasks.
For chart extraction, Granite 4.0 3B Vision supports three main tags.
hart2csv>: The model reads the chart and outputs a CSV text table with headers and numeric values.hart2summary>: The model returns a short, structured description of the main trends and values in the chart.hart2code>: The model returns Python plotting code that recreates the chart, often with libraries like Matplotlib.You can reuse the same chart image and change only the tag to get different outputs. This approach keeps prompts simple and removes the need to design complex natural language instructions for each chart.
For tables, Granite 4.0 3B Vision supports three structured formats via tags.
<tables_json>: Returns a JSON structure with table dimensions, cell content, and merge information.<tables_html>: Returns HTML <table> markup that standard tools can render and parse.<tables_otsl>: Returns OTSL, an intermediate markup language that captures spans, merges, and structure for further processing.To use these tags, send a page image that contains one or more tables. The model segments the tables and fills the chosen format. Output can feed into pipelines that load JSON into databases or HTML into downstream parsers.
Semantic key‑value extraction reads forms and pulls out fields such as name, date, and total amount. Instead of matching exact labels, the model uses descriptions from the prompt.
A typical prompt includes a short schema description like “Extract the following fields: invoice_number, issue_date, supplier_name, net_amount, tax_amount, total_amount. Return JSON.”
The model then scans the form image and returns a JSON object with keys and values.
On the VAREX benchmark, which uses more than 1,700 US government forms, Granite 4.0 3B Vision reaches 85.5% exact‑match accuracy in zero‑shot mode. Exact‑match means that all key‑value pairs for a form must match ground truth to count as correct.
Granite 4.0 3B Vision focuses on single images per request, but it fits into larger pipelines with multiple pages.
For multi‑page PDFs, split pages into images and send them one by one with tags that match each task. Use a separate text‑only model, such as a Granite text model, for long‑form analysis or summarization of extracted content.
The vLLM integration allows both text‑only and image requests on one endpoint. That keeps deployment simple for applications that mix RAG, question answering, and structured extraction.
The table below summarizes key Granite 4.0 3B Vision benchmark scores on chart, table, and form extraction tasks.
These scores show that Granite 4.0 3B Vision performs near the top of its parameter class. It often matches or outperforms larger vision models on chart, table, and form extraction tasks.
IBM created ChartNet, a dataset with about 1.7 million synthetic and real chart samples. Each sample includes plotting code, a rendered chart image, the underlying data table, a natural‑language chart summary, and question–answer pairs.
Granite 4.0 3B Vision was evaluated on a human‑verified ChartNet benchmark that tests Chart2Summary and Chart2CSV quality. An LLM‑as‑a‑judge procedure scores outputs on correctness and faithfulness to the chart.
On this benchmark, Granite 4.0 3B Vision achieves 86.4% for Chart2Summary and 62.1% for Chart2CSV, placing first and second among all tested models on each task.
For tables, IBM built a unified evaluation suite spanning PubTablesV2, OmniDocBench, and TableVQA. PubTablesV2 measures table structure and content reconstruction with the TEDS metric.
OmniDocBench includes complex layouts with multiple tables and surrounding content. TableVQA tests whether models can answer questions about tables after extracting them.
Granite 4.0 3B Vision outputs HTML tables, which are then scored by TEDS. The model leads across this suite, with 92.1 on cropped PubTablesV2 tables, 79.3 on full‑page PubTablesV2 tables, 64.0 on OmniDocBench, and 88.1 on TableVQA.
These results indicate strong performance both on clean table crops and on full pages with surrounding text and figures.
The VAREX benchmark targets structured key‑value extraction from real US government forms. It includes 1,777 forms with diverse layouts, from simple flat structures to nested and tabular fields. Models must output key‑value pairs for a specified schema and are scored by exact‑match accuracy.
Granite 4.0 3B Vision obtains 85.5% exact‑match accuracy in a zero‑shot setting. Zero‑shot here means the model was not fine‑tuned on VAREX forms.
The score places it third among 2–4B parameter models as of March 2026 and competitive with larger models.
Earlier Granite Vision models, such as Granite Vision 3.2 2B and Granite Vision 3.3 2B, already reached strong scores on document benchmarks like DocVQA and ChartQA.
For example, granite‑vision‑3.3‑2b reports ChartQA scores around 0.87 and DocVQA scores up to 0.91. Granite 4.0 3B Vision extends this focus on charts and tables with a modern DeepStack variant, ChartNet data, and stronger KVP extraction.
The table below compares Granite 4.0 3B Vision with related and competing vision models.
Granite 4.0 3B Vision itself is an open‑source model, but many teams mix local and hosted options. The table below summarizes realistic pricing paths based on current IBM watsonx.ai information and related Granite Vision models.
Prices may change, and regional differences apply, so always check the latest IBM pricing page before planning costs.
Granite 4.0 3B Vision stands out through its narrow but deep focus on document structure rather than broad natural image tasks. The ChartNet dataset and
DeepStack architecture together push chart summarization and chart‑to‑CSV extraction to the top of current benchmarks for models of similar or larger size. Its table extraction results on PubTablesV2, OmniDocBench, and TableVQA show that a 3B model can rival or beat much larger VLMs in this niche.
At the same time, the Apache 2.0 license and vLLM integration make it practical for local deployments where data cannot leave internal networks.
The chart below maps common needs to suitable options.
This example walks through a simple workflow: extract a chart, a table, and key fields from a page of a financial report using Granite 4.0 3B Vision running on vLLM.
A. Prepare the input image
report_q4_2025.png.B. Start the Granite 4.0 3B Vision server
start_granite4_vision_server.py script.http://localhost:8000/v1.C. Extract the chart as CSV
report_q4_2025.png, encodes it as base64, and sends a chat completion request with the tag hart2csv>.Quarter,Revenue and numerical values that match the chart.D. Extract the chart summary
hart2summary>.E. Extract the table structure as JSON
<tables_json>.F. Extract key‑value pairs for a summary box
G. Integrate into an ETL pipeline
This workflow mirrors how Granite 4.0 3B Vision was designed to operate: as a local, structured extraction engine for enterprise documents.
Granite 4.0 3B Vision offers a focused answer to chart, table, and form extraction needs in enterprise documents. The open Apache 2.0 license and vLLM integration give teams a clear path to private, on‑prem deployments.
For organizations that handle many complex PDFs and scanned forms, Granite 4.0 3B Vision is a strong candidate for the extraction layer in a broader document AI stack.
Training and documentation focus on English, and most benchmarks use English documents. It may work on other languages, but accuracy can drop, so careful evaluation is important.
The model includes strong OCR‑related abilities on tables, charts, and forms, similar to earlier Granite Vision models that score well on OCRBench. For simple text‑only scans, a classic OCR tool may still be faster, but Granite 4.0 3B Vision shines when layout and structure matter.
Community reports and third‑party tests suggest that GPUs with 8–12 GB VRAM, such as RTX 3060‑class cards, can run Granite 4.0 3B Vision at practical speeds. Larger GPUs improve throughput and batch size, especially in production.
Research on Granite Vision models shows that small Granite Vision variants can approach or match much larger proprietary models on document tasks like DocVQA and ChartQA. Granite 4.0 3B Vision extends this trend on chart and table extraction, but large proprietary models still lead on broad general‑purpose multimodal reasoning.
Granite 4.0 3B Vision is currently available first as an open model on Hugging Face. For teams that want cloud hosting, using Granite Vision 3.x on watsonx is the closest managed alternative until Granite 4.0 appears as a hosted option.
Connect with top remote developers instantly. No commitment, no risk.
Tags
Discover our most popular articles and guides
Running Android emulators on low-end PCs—especially those without Virtualization Technology (VT) or a dedicated graphics card—can be a challenge. Many popular emulators rely on hardware acceleration and virtualization to deliver smooth performance.
The demand for Android emulation has soared as users and developers seek flexible ways to run Android apps and games without a physical device. Online Android emulators, accessible directly through a web browser.
Discover the best free iPhone emulators that work online without downloads. Test iOS apps and games directly in your browser.
Top Android emulators optimized for gaming performance. Run mobile games smoothly on PC with these powerful emulators.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.
ApkOnline is a cloud-based Android emulator that allows users to run Android apps and APK files directly from their web browsers, eliminating the need for physical devices or complex software installations.
Choosing the right Android emulator can transform your experience—whether you're a gamer, developer, or just want to run your favorite mobile apps on a bigger screen.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.