Baidu

Run Baidu Unlimited-OCR Locally: Transformers, vLLM & SGLang (2026 Guide)

Baidu's Unlimited-OCR parses entire multi-page PDFs in a single forward pass. Here's how to run the 3.3B open-weights model locally with Transformers, vLLM, or SGLang.

Published 05 Jul 2026 • Updated 05 Jul 2026 • 4 min read

Quick answer. Unlimited-OCR is Baidu's MIT-licensed 3.3B vision-language OCR model, released June 22, 2026. Its headline trick is one-shot multi-page parsing: it transcribes an entire PDF in a single forward pass instead of page-by-page. Run it locally via Hugging Face Transformers, the official vLLM Docker image (vllm/vllm-openai:unlimited-ocr), or SGLang — a single 12 GB+ consumer GPU is enough.

DeepSeek-OCR started the "optical compression" wave in 2025. Baidu's Unlimited-OCR — the team explicitly says it aims to push DeepSeek-OCR one step further — is the strongest follow-up so far: over one million Hugging Face downloads in its first two weeks, a top-3 trending slot, and one of the most upvoted r/LocalLLaMA threads of late June 2026.

This guide covers what the model actually does differently, the hardware you need, and three working local setups: plain Transformers, vLLM, and SGLang.

What is Unlimited-OCR?

Unlimited-OCR is a 3.3B-parameter image-text-to-text model from Baidu, released under the MIT license on June 22, 2026, with the paper (arXiv 2606.23050, "Unlimited OCR Works") following a day later. It converts document images — scans, screenshots, photographed pages, handwriting — into clean structured text and Markdown/LaTeX.

The release timeline moved unusually fast, which tells you how much backing it has:

June 22: weights on Hugging Face (MIT license)
June 23: paper on arXiv + ModelScope mirror
June 24: free Hugging Face Spaces demo
June 28: official vLLM support with prebuilt Docker images
July 3: hosted API on Baidu Cloud

Why is one-shot multi-page parsing a big deal?

Every practical OCR pipeline until now — DeepSeek-OCR included — processes documents page-by-page: split the PDF, run the model N times, stitch the outputs back together. That loses cross-page context (tables that continue across a page break, footnotes, running headers) and multiplies overhead.

Unlimited-OCR's "long-horizon parsing" mode accepts dozens of page images in a single forward pass and emits one coherent transcript. Community tests have shown it turning a handwritten calculus exam into clean LaTeX and parsing complex multi-page PDFs in one go — including non-Latin scripts (it's genuinely multilingual, with users confirming solid Cyrillic support).

What hardware do you need?

The weights are about 6.7 GB in bfloat16, so this is one of the easiest "serious" models to self-host in 2026:

Comfortable: any 12 GB+ NVIDIA card (RTX 3060 12GB, 4070, and up) — weights plus vision tokens and a 32K context fit with room to spare
Tested stack: Python 3.12, CUDA 12.9, torch 2.10.0, transformers 4.57.1
Batch/server use: the vLLM image targets CUDA 13.0 by default, with a separate -cu129 image for Hopper GPUs

How do you run it with Transformers?

The fastest path to a first result. Install the pinned requirements from the model card, then:

import torch
from transformers import AutoModel, AutoTokenizer

model_name = 'baidu/Unlimited-OCR'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name, trust_remote_code=True,
    use_safetensors=True, torch_dtype=torch.bfloat16,
).eval().cuda()

# Single image — 'gundam' config (best quality/speed balance)
model.infer(
    tokenizer,
    prompt='<image>document parsing.',
    image_file='your_image.jpg',
    output_path='out/',
    base_size=1024, image_size=640, crop_mode=True,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=128,
    save_results=True,
)

For multi-page documents, switch to infer_multi with the base config (image_size=1024, ngram_window=1024) and pass a list of page images. For PDFs, convert pages to PNGs first with PyMuPDF at 300 dpi — the model card ships a ready-made pdf_to_images() helper.

How do you serve it with vLLM?

Since June 28 there's an official recipe and prebuilt Docker images, which is the route to take for anything beyond experimentation:

# Default (CUDA 13.0)
docker pull vllm/vllm-openai:unlimited-ocr

# Hopper GPUs (CUDA 12.9)
docker pull vllm/vllm-openai:unlimited-ocr-cu129

Follow the official recipe at recipes.vllm.ai/baidu/Unlimited-OCR for launch flags. You get an OpenAI-compatible endpoint, so any existing OCR pipeline that talks to a chat-completions API can point at it directly.

How do you serve it with SGLang?

SGLang support currently ships as a dev wheel linked from the model card. Setup is uv-based:

uv venv --python 3.12 && source .venv/bin/activate
uv pip install wheel/sglang-*.whl kernels==0.11.7 pymupdf==1.27.2.2

python -m sglang.launch_server \
    --model baidu/Unlimited-OCR \
    --served-model-name Unlimited-OCR \
    --attention-backend fa3 \
    --context-length 32768 \
    --enable-custom-logit-processor \
    --host 0.0.0.0 --port 10000

One caveat: requests must attach the custom DeepseekOCRNoRepeatNGramLogitProcessor (ngram_size 35) that the model uses to suppress repetition loops — the model card has a complete streaming client example. Single images can use the higher-quality gundam image mode; multi-page parsing uses base mode with ngram_window=1024.

How does it compare to DeepSeek-OCR?

Baidu credits DeepSeek-OCR, DeepSeek-OCR-2 and PaddleOCR directly in the acknowledgements — this is an evolution of that line of work, not a rival paradigm. The practical differences:

Multi-page in one pass vs strictly page-by-page — the headline upgrade
Compact: 3.3B parameters with day-one vLLM/SGLang support
MIT license, same as DeepSeek-OCR — safe for commercial pipelines

If you're currently running DeepSeek-OCR, our DeepSeek-OCR local guide still applies to that model — but for multi-page document workloads, Unlimited-OCR is now the stronger default. For broader document-AI workloads that need UI understanding or video, see our Qwen3-VL 4B use-cases guide.

FAQ

Is Unlimited-OCR free for commercial use?

Yes. The model, code, and paper are all released under the MIT license, which permits commercial use, modification, and redistribution without a separate agreement.

Can Unlimited-OCR parse a whole PDF at once?

Yes — that's its defining feature. Convert the PDF pages to images (the model card ships a PyMuPDF helper) and pass them all to infer_multi or the vLLM/SGLang multi-image API in one request. It emits a single coherent transcript across pages.

What GPU do I need to run Unlimited-OCR locally?

The bf16 weights are roughly 6.7 GB, so a 12 GB consumer GPU (RTX 3060 12GB or better) runs it comfortably. The officially tested stack is Python 3.12 with CUDA 12.9 and torch 2.10.

Does it handle handwriting and non-English text?

Yes. It's trained multilingually — community tests confirm strong results on Cyrillic documents and even handwritten math converted to LaTeX. As with any OCR model, extremely low-quality scans still degrade output.

Is there a hosted option if I don't want to self-host?

Yes — a free Hugging Face Spaces demo for quick tests, and a hosted API on Baidu Cloud since July 3, 2026. For privacy-sensitive documents, local inference via vLLM remains the better choice.

Self-hosting more than just OCR? Our self-hosting LLMs complete guide covers the full stack — hardware sizing, serving engines, and quantization — for running open models in production.