19 min to read
ByteDance has released Dolphin v2, a revolutionary open-source universal document parsing model that represents a significant leap forward in document understanding technology.
Unlike generic vision language models, Dolphin v2 is specifically engineered for extracting structured data from documents—whether they're clean digital PDFs or distorted photographed scans.
With a 14% performance improvement over its predecessor and support for 21 element categories (up from 14), this lightweight 3-billion-parameter model built on Qwen2.5-VL delivers enterprise-grade parsing capabilities at near-zero cost.
For developers, content creators, and digital teams managing high-volume document processing, Dolphin v2 offers a compelling open-source alternative to expensive cloud-based solutions like AWS Textract and Google Document AI.
Dolphin v2 is an enhanced universal document parsing model designed to transform unstructured document images into structured, machine-readable data.
Unlike traditional OCR systems that focus purely on text recognition, Dolphin v2 understands document layout, element relationships, and reading order while simultaneously extracting text, formulas, tables, and code blocks with remarkable precision.
The model operates on a document-type-aware two-stage architecture that distinguishes between digitally-born PDFs (clean, perfect geometry) and photographed documents (with realistic distortions, skewing, and perspective changes).
This differentiation allows Dolphin v2 to apply optimized parsing strategies for each document type, resulting in superior accuracy across diverse real-world scenarios.
| Specification | Details |
|---|---|
| Base Architecture | Qwen2.5-VL-3B with Native Resolution Vision Transformer (NaViT) |
| Model Size | 3 billion parameters |
| Parameter Count vs Original | Increased from previous version for enhanced capability |
| Vision Encoder | NaViT (Native Resolution Vision Transformer) |
| Output Decoder | Autoregressive transformer for structured generation |
| Supported Element Categories | 21 (expanded from 14 in original Dolphin) |
| Output Formats | JSON, Markdown, HTML |
| Hardware Requirement (GPU) | 8-12 GB VRAM (tested on RTX 6000 48GB) |
| Processing Speed | ~0.1729 FPS (nearly 2× faster than comparable models) |
| Open Source | Yes, available on Hugging Face |
| License Type | Free for research and commercial use |
Dolphin v2's performance gains are substantial and measurable across every critical dimension:
| Metric | Dolphin v2 Score | Improvement vs Original | Benchmark Details |
|---|---|---|---|
| Overall Score | 89.45 | +14.78 points (+19.8%) | Comprehensive multi-task evaluation |
| Text Recognition (Edit Distance) | 0.054 | ↓ from 0.125 (-56.8%) | Lower is better; measures character-level accuracy |
| Formula Parsing (CDM) | 86.72 | ↑ from 67.85 (+27.8%) | Character Difference Metric; LaTeX generation |
| Table Structure (TEDS) | 87.02 | ↑ from 68.70 (+26.7%) | Tree Edit Distance Similarity for table cells |
| Table Structure (TEDS-S) | 90.48 | Significant improvement | Structural correctness metric |
| Reading Order (Edit Distance) | 0.054 | Maintains high precision | Correct element sequencing |
| Processing Speed | 0.1729 FPS | ~2× faster | Frames per second; measured on standard hardware |
A text recognition edit distance of 0.054 means Dolphin v2 achieves near-perfect character accuracy—only 5-6 character errors per 100 characters on average. For context:
The 87.02 TEDS score for table extraction indicates Dolphin v2 correctly identifies over 87% of table structure elements, including proper cell spanning, row/column relationships, and cell content—critical for financial documents, invoices, and research tables.
In this intelligent first stage, Dolphin v2 performs three simultaneous operations:
Document Type Classification: The model instantly determines whether the input is a clean digital document or a photographed/scanned version with distortions, shadows, or perspective skew. This classification triggers different optimization pathways in Stage 2.
Layout Analysis: Dolphin v2 analyzes the entire page to identify logical element boundaries and spatial relationships. Rather than processing text line-by-line, it understands document structure.
Reading Order Generation: Elements are sequenced in natural reading order (top-to-bottom, left-to-right for English), which is essential for maintaining semantic coherence when extracting from multi-column layouts.
The second stage applies type-specific parsing strategies:
For Digital Documents (PDFs): Employs element-wise parallel parsing—the model processes multiple document elements simultaneously, dramatically reducing inference time. Type-specific prompts guide extraction for text, tables, formulas, and code blocks independently.
For Photographed Documents: Uses holistic page-level parsing that considers the entire page context, accounting for perspective distortion, lighting variations, and partial occlusion. This approach is computationally more intensive but handles real-world degradation better.
Specialized Parsing Modules:
Dolphin v2's expanded element support represents a fundamental improvement in document parsing capability:
| Element Type | Use Case | Format Output |
|---|---|---|
Paragraph (para) | Body text, descriptions, content blocks | Plain text |
Heading (head) | Section titles, document headings | Hierarchical markup |
Title (title) | Document titles, main headings | Formatted text |
Subheading (subhead) | Section subdivisions | Structured text |
Table of Contents (catalogue) | TOC entries, navigation | Hierarchical list |
Table (tab) | Data tables, comparison matrices | HTML with cell structure |
Lists (list) | Ordered/unordered lists, bullet points | HTML list markup |
Code Blocks (code) | Program code, technical snippets | Plain text with indentation |
Formulas (formula) | Mathematical equations, notation | LaTeX (.\(...\)) |
Figures (fig) | Images, diagrams, charts | Bounding box coordinates |
Captions (cap) | Figure captions, image labels | Associated text |
Footnotes (fnote) | Reference notes, citations | Linked annotations |
References (reference) | Bibliography, citations | Structured list |
Headers/Footers (header/foot) | Page headers, footers | Marginal content |
Watermarks (watermark) | Document watermarks | Detection and removal |
Annotations (anno) | Handwritten notes, highlights | Localized content |
Page Number (page_num) | Page numbering information | Numerical value |
Footnote/Endnote Ref (fnote_ref) | Superscript references | Linked indicators |
| Key-Value Pairs (implicit) | Form fields, structured data | JSON key-value format |
| Metadata (implicit) | Author, dates, document properties | Structured fields |
This comprehensive categorization enables Dolphin v2 to handle diverse document types—academic papers, financial invoices, legal contracts, technical documentation, and more—without requiring model retraining or specialized variants.
Input Document: A technical PDF containing chapter 7 ("The Zeta Function and Prime Number Theorem") with mixed mathematical formulas, paragraphs, and code references.
Results:
GPU Memory Usage: 8.65 GB during inference
Processing Time: ~3-4 seconds for full-page extraction
Input Document: A data table comparing machine learning methods with columns for "Method", "Error %", and performance metrics.
Results:
Output Quality: 87%+ TEDS score on complex tables
Input Document: An AI-generated Indonesian driving license (PDF) with structured layout, photos, and organized fields.
Results:
Processing Speed: ~2 seconds (faster than page-level parsing due to simpler structure)
Input Document: A commercial invoice with line items, totals, and company details.
Results:
Accuracy: Spot-on for all critical fields
| Component | Requirement |
|---|---|
| GPU VRAM | 8 GB minimum (12 GB recommended) |
| System RAM | 16 GB minimum (32 GB for batch processing) |
| Storage | 10 GB free space for model weights |
| Python | 3.9 or higher |
| CUDA | 11.8 or higher (for NVIDIA GPUs) |
| GPU Supported | NVIDIA CUDA-compatible, AMD ROCm |
bashgit clone https://github.com/bytedance/Dolphin.gitcd Dolphin
Step 2: Create and Activate Conda Environment
bashmamba env create --file conda-env.yml
conda activate dolphin-env
Step 3: Install Dependencies and Dolphin
bashpip install -r requirements.txtpython -m pip install .
Step 4: Download Model Weights from Hugging Face
bash# Option 1: Via Hugging Face CLI
huggingface-cli download ByteDance/Dolphin-v2 --local-dir ./hf_model# Option 2: Via Python snapshot_download
from huggingface_hub importsnapshot_download("ByteDance/Dolphin-v2", local_dir="./hf_model")
Single Page Parsing:
bashpython demo_page.py --model_path ./hf_model --save_dir ./results \
--input_path /path/to/document.png
Batch Processing Multiple Documents:
bashpython demo_page.py --model_path ./hf_model --save_dir ./results \
--input_path ./documents_folder --max_batch_size 8
Output Files Generated:
page.json - Structured JSON representation with all extracted elementspage.md - Markdown-formatted output for human readabilitypage_layout.html - Visual layout diagram showing element positionsfigures/ - Directory containing extracted imageselements/ - Directory with individual element extraction detailsFor systems with limited VRAM, adjust batch size:
bash--max_batch_size 4 # Reduces memory consumption
For faster processing, use CPU offloading with GPU acceleration:
bash# Leverage both GPU and CPU for hybrid acceleration
export CUDA_VISIBLE_DEVICES=0
| Feature | Dolphin v2 | AWS Textract | Google Doc AI | LLaMA 3.2 Vision | Claude 3.5 Vision |
|---|---|---|---|---|---|
| Deployment | Open-source, local/cloud | AWS cloud only | Google Cloud only | Open-source, local | Proprietary API |
| Pricing | FREE | $1.50/1000 pages | $1.50/1000 pages | FREE (self-hosted) | $0.003 per image |
| Field Accuracy | 98%+ | 78% | 82% | 85-90% | 92% |
| Table Extraction | 87%+ TEDS | 82% | 70-75% | 65-75% | 80%+ |
| Formula Recognition | 86%+ CDM | Limited | Minimal | Moderate | Good |
| Code Block Parsing | Dedicated module | No | No | Moderate | Moderate |
| Processing Speed | 0.1729 FPS | 0.05 FPS | 0.08 FPS | Variable | Variable |
| Element Categories | 21 types | ~8 types | ~6 types | General categories | General categories |
| Specialized Modules | Yes (4 modules) | Integrated approach | Single pipeline | General VLM | General VLM |
| Privacy/Data Control | Local inference ✅ | Cloud processing | Cloud processing | Local or cloud | Cloud only |
| Custom Fine-tuning | Supported | Limited | Not user-accessible | Supported | Not accessible |
| Integration Complexity | Moderate | High (AWS SDK) | Moderate (GCP) | Moderate | Low (API) |
| Learning Curve | Steep (technical) | Moderate | Moderate | Steep | Low |
| Multi-language Support | English, Chinese | 140+ languages | Multiple | Multiple | Multiple |
| Batch Processing | Parallel/efficient | Sequential | Sequential | Flexible | Sequential |
| Free Trial | Yes (full features) | $100 credit | Free tier limited | N/A | N/A |
1. Cost-Effectiveness: Dolphin v2 is completely free and open-source. Process unlimited documents without paying per-page fees. For enterprises processing millions of pages annually, this represents 60-80% cost savings compared to AWS or Google.
2. Data Privacy: Run document parsing entirely on-premises without sending data to cloud services. Ideal for healthcare, legal, and financial institutions with strict data residency requirements.
3. Speed and Efficiency: At 0.1729 FPS, Dolphin v2 processes documents nearly 2× faster than comparable models while maintaining superior accuracy. The parallel processing architecture enables efficient batch processing.
4. Specialized Expertise: Unlike general vision language models that treat document parsing as just one capability, Dolphin v2's architecture is purpose-built for document understanding. Dedicated modules for formulas, code, tables, and paragraphs demonstrate this specialization.
5. Element Precision with Absolute Coordinates: Dolphin v2 uses absolute pixel coordinates for spatial localization, enabling precise bounding box extraction and downstream processing tasks.
6. No Vendor Lock-in: Being open-source under a permissive license, organizations maintain full control. No dependency on API availability, pricing changes, or policy modifications.
Choose AWS Textract if:
Choose Google Document AI if:
Choose LLaMA 3.2 Vision if:
Choose Claude 3.5 Vision if:
Invoice and Receipt Processing: Automatically extract vendor information, line items, amounts, and tax data from supplier invoices for automated accounts payable workflows. Dolphin v2's accurate table extraction (87% TEDS) ensures correct line-item parsing even from multi-currency or complex invoices.
Real Example: A mid-sized manufacturing company processing 50,000 invoices monthly could save $75,000+ annually compared to AWS Textract, while maintaining >95% accuracy.
Contract Processing: Extract key contract terms, effective dates, parties, payment amounts, and special conditions from legal documents. The reading order precision ensures related information stays connected during extraction.
Regulatory Reporting: Automate extraction of structured data from compliance documents, financial statements, and regulatory filings.
Clinical Document Processing: Extract patient information, diagnoses, medications, and test results from medical records while maintaining HIPAA compliance through on-premise processing.
Insurance Claims: Automatically parse claim forms, medical records, and supporting documentation to accelerate claims processing.
Research Paper Processing: Extract research papers' structural elements—abstract, methodology, results, references—with dedicated formula recognition for mathematical content. Ideal for building academic databases and literature management systems.
Grade Sheets and Academic Records: Parse student records, transcripts, and grading documents with high accuracy.
Product Information Extraction: Parse product specification sheets, technical documentation, and supplier catalogs into structured formats for e-commerce catalogs.
Receipt Processing: Extract purchase details from digital and scanned receipts for expense tracking and business intelligence.
Property Documentation: Process lease agreements, property listings, inspection reports, and architectural plans.
Document Verification: Extract and verify key information from property deeds and land records.
Unlike traditional document parsing approaches that apply uniform strategies to all documents, Dolphin v2 intelligently detects whether a document is digitally-born or photographed, then applies optimized parsing logic. This architectural innovation directly translates to:
Competitive Advantage: No other open-source solution offers this degree of document-type intelligence. AWS Textract applies the same approach but charges per page.
Dolphin v2's 21 element categories aren't just enumeration—they're backed by specialized parsing modules:
This comprehensive categorization reduces the need for post-processing and model chaining.
Dolphin v2 represents a fundamental shift in document parsing economics. Organizations processing 100,000 pages monthly would typically spend $150+ with AWS Textract. With Dolphin v2:
The 0.1729 FPS processing speed, enabled by parallel element parsing, means:
Benchmark results demonstrate Dolphin v2's superiority on specialized content:
Edit distance measures the minimum number of single-character edits (insertions, deletions, replacements) needed to transform recognized text into ground truth.
Interpretation:
CDM is specialized for mathematical formula evaluation, considering both character-level accuracy and structural correctness.
Dolphin v2 Score: 86.72 (out of 100)
TEDS evaluates table parsing on two dimensions:
Dolphin v2 Score: 87.02 + TEDS-S: 90.48
Dolphin v2: 0.1729 FPS
Unlike traditional OCR that provides character positions, Dolphin v2 outputs absolute pixel coordinates for all extracted elements. This enables:
Trained on diverse multilingual corpora:
For Digital Documents:
For Photographed Documents:
Generate extraction in multiple formats suited to downstream processing:
bash--max_batch_size 8 # Process 8 elements simultaneously
--num_workers 4 # Use 4 CPU workers for I/O
Enables efficient processing of document collections with optimal GPU utilization.
Challenge: While overall performance is strong, Dolphin v2 shows occasional inconsistency on:
Mitigation: Preprocess documents to improve contrast and straighten pages using image enhancement techniques.
Challenge: The model lacks built-in confidence scores for extraction reliability. Developers can't programmatically determine which extractions to trust vs. manually review.
Mitigation: Implement custom confidence scoring by comparing extracted content against bounding box context.
Challenge: Dolphin v2 doesn't specialize in handwritten text extraction, limiting applicability in documents like:
Mitigation: Use alternative models for handwritten content, then post-process combined results.
Challenge: Optimal performance requires dedicated NVIDIA GPU (8 GB+ VRAM). CPU-only inference is possible but slow (~0.5-1 FPS on high-end CPUs).
Mitigation: Use GPU rental services for batch processing without capital investment.
Challenge: Extremely complex documents with nested tables within tables, sidebars with their own tables, or multi-level headers occasionally confuse the reading order.
Mitigation: Validate extraction through sampling and implement feedback loops for high-stakes applications.
Challenge: While Dolphin v2 supports fine-tuning for custom domains, comprehensive guides for domain-specific adaptation are sparse.
Mitigation: Community contributions and documentation improvements are ongoing.
| Solution | Per-Page Cost | Annual Cost | Infrastructure | Setup Cost | Total Y1 |
|---|---|---|---|---|---|
| Dolphin v2 (Self-hosted) | $0 | $0 | GPU rental (~$200/month) | $500 | $2,900 |
| Dolphin v2 (On-premise) | $0 | $0 | Hardware (amortized) | $3,000 | $3,000 |
| AWS Textract | $1.50/1k | $150 | AWS account | $100 | $250 |
| Google Document AI | $1.50/1k | $150 | GCP account | $100 | $250 |
| Azure Document Intelligence | $1.50/1k | $150 | Azure account | $100 | $250 |
| Claude 3.5 Vision | $0.003/image | $300 | API access | $50 | $350 |
Break-even Analysis:
| Solution | Free Tier | Free Limit | Free Trial Duration |
|---|---|---|---|
| Dolphin v2 | Unlimited | Unlimited (self-hosted) | Permanent |
| AWS Textract | Yes | 100 pages/month | 12 months |
| Google Document AI | Limited | 200 calls/month | — |
| Azure Document Intelligence | Limited | 200 calls/month | Free tier available |
| Claude 3.5 Vision | API only | $5 free credits | Varies by signup |
Symptom: torch.cuda.OutOfMemoryError: CUDA out of memory
Solution:
bash# Reduce batch size
--max_batch_size 2
# Enable CPU fallback for non-critical operations
export CUDA_LAUNCH_BLOCKING=1
# Use memory-efficient attention
--use_flash_attention 2
Symptom: Interrupted download from Hugging Face
Solution:
bash# Resume download with cache
huggingface-cli download ByteDance/Dolphin-v2 --local-dir ./hf_model --resume-download# Alternative: Manual download and extract https://huggingface.co/ByteDance/Dolphin-v2/resolve/main/model.safetensors
wget
Symptom: Incorrectly parsed table structure or missing cells
Solution:
--table_format html vs. jsonSymptom: Garbled output for non-English documents
Solution:
bash# Explicitly specify language
--language zh # For Chinese
--language ja # For Japanese
Based on ByteDance's GitHub repository and community feedback, anticipated improvements include:
ByteDance Dolphin v2 has arrived at a critical inflection point in document parsing technology. It democratizes enterprise-grade document understanding by making sophisticated, specialized capabilities freely available to anyone with moderate GPU resources.
For Individual Developers: Dolphin v2 provides a powerful, cost-free tool for building document processing features without vendor dependency or per-page fees.
For Startups: Building a document-centric SaaS business becomes economically viable. Infrastructure costs shift from per-customer API fees to one-time GPU investment.
For Enterprises: The combination of superior accuracy, complete data privacy, and dramatically lower costs justifies migration from cloud-based solutions despite increased operational complexity.
For Researchers: The open-source nature and modular architecture create opportunities for academic contributions and domain-specific optimizations.
The 14% performance improvement over its predecessor, combined with expanded element categories, demonstrates ByteDance's commitment to continuous refinement. While challenges exist—handwriting recognition, confidence scoring, complex nested structures—the roadmap suggests active development addressing known limitations.
ByteDance Dolphin v2 is an open-source universal document parsing model designed to extract structured data such as text, tables, code blocks, and formulas from PDFs and document images with high accuracy. It uses a document-type-aware two-stage architecture that first classifies the document type and layout, then applies specialized parsing modules for different element categories.
Dolphin v2 delivers very high accuracy across key document understanding tasks, including near-OCR-level text accuracy, strong table structure recognition, and reliable formula parsing. Its benchmark scores place it ahead of many generic vision-language models, making it suitable for production-grade use in finance, legal, and other data-sensitive industries.
For many structured document use cases, Dolphin v2 offers competitive or superior accuracy while giving you full control through local or self-hosted deployment. Unlike AWS Textract and Google Document AI, it does not charge per page, which can significantly reduce costs at scale, especially for startups and enterprises processing large document volumes.
To run Dolphin v2 efficiently, a modern NVIDIA GPU with at least 8–12 GB of VRAM is recommended, along with 16–32 GB of system RAM. While CPU-only inference is possible, it is much slower, so teams aiming for high throughput or batch processing will benefit from dedicated GPU hardware or cloud GPU instances.
Dolphin v2 is ideal for developers, SaaS builders, and enterprises that need accurate, large-scale document parsing without relying on third-party cloud APIs. Popular use cases include invoice and receipt extraction, contract and legal document analysis, medical and insurance document processing, research paper parsing, and large-scale PDF-to-structured-data conversion.
Rating: 9.2/10
ByteDance Dolphin v2 stands as the most compelling open-source document parsing solution available today. Its combination of specialized architecture, impressive benchmarks, zero cost, data privacy benefits, and rapid processing speed makes it the go-to choice for organizations serious about document automation.
The learning curve and infrastructure requirements prevent a perfect score, but for technical teams with GPU access, Dolphin v2 is unquestionably the superior choice over cloud-based alternatives.
Recommended For: Technical teams, document-heavy startups, enterprises with large-scale processing needs, research institutions, and organizations prioritizing data privacy.
Not Recommended For: Non-technical users, organizations with exclusively handwritten document workflows, or those requiring 99.99% SLA guarantees and enterprise support.
Connect with top remote developers instantly. No commitment, no risk.
Tags
Discover our most popular articles and guides
Running Android emulators on low-end PCs—especially those without Virtualization Technology (VT) or a dedicated graphics card—can be a challenge. Many popular emulators rely on hardware acceleration and virtualization to deliver smooth performance.
The demand for Android emulation has soared as users and developers seek flexible ways to run Android apps and games without a physical device. Online Android emulators, accessible directly through a web browser.
Discover the best free iPhone emulators that work online without downloads. Test iOS apps and games directly in your browser.
Top Android emulators optimized for gaming performance. Run mobile games smoothly on PC with these powerful emulators.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.
ApkOnline is a cloud-based Android emulator that allows users to run Android apps and APK files directly from their web browsers, eliminating the need for physical devices or complex software installations.
Choosing the right Android emulator can transform your experience—whether you're a gamer, developer, or just want to run your favorite mobile apps on a bigger screen.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.