IBM has launched Granite-Docling-258M, an open-source (Apache-2.0) vision-language mannequin designed particularly for end-to-end doc conversion. The mannequin targets layout-faithful extraction—tables, code, equations, lists, captions, and studying order—emitting a structured, machine-readable illustration quite than lossy Markdown. It’s out there on Hugging Face with a dwell demo and MLX construct for Apple Silicon.
What’s new in comparison with SmolDocling?
Granite-Docling is the product-ready successor to SmolDocling-256M. IBM changed the sooner spine with a Granite 165M language mannequin and upgraded the imaginative and prescient encoder to SigLIP2 (base, patch16-512) whereas retaining the Idefics3-style connector (pixel-shuffle projector). The ensuing mannequin has 258M parameters and exhibits constant accuracy good points throughout structure evaluation, full-page OCR, code, equations, and tables (see metrics beneath). IBM additionally addressed instability failure modes noticed within the preview mannequin (e.g., repetitive token loops).
Structure and coaching pipeline
- Spine: Idefics3-derived stack with SigLIP2 imaginative and prescient encoder → pixel-shuffle connector → Granite 165M LLM.
- Coaching framework: nanoVLM (light-weight, pure-PyTorch VLM coaching toolkit).
- Illustration: Outputs DocTags, an IBM-authored markup designed for unambiguous doc construction (parts + coordinates + relationships), which downstream instruments convert to Markdown/HTML/JSON.
- Compute: Skilled on IBM’s Blue Vela H100 cluster.
Quantified enhancements (Granite-Docling-258M vs. SmolDocling-256M preview)
Evaluated with docling-eval
, LMMS-Eval, and task-specific datasets:
- Format: MAP 0.27 vs. 0.23; F1 0.86 vs. 0.85.
- Full-page OCR: F1 0.84 vs. 0.80; decrease edit distance.
- Code recognition: F1 0.988 vs. 0.915; edit distance 0.013 vs. 0.114.
- Equation recognition: F1 0.968 vs. 0.947.
- Desk recognition (FinTabNet @150dpi): TEDS-structure 0.97 vs. 0.82; TEDS with content material 0.96 vs. 0.76.
- Different benchmarks: MMStar 0.30 vs. 0.17; OCRBench 500 vs. 338.
- Stability: “Avoids infinite loops extra successfully” (production-oriented repair).
Multilingual help
Granite-Docling provides experimental help for Japanese, Arabic, and Chinese language. IBM marks this as early-stage; English stays the first goal.
How the DocTags pathway modifications Doc AI
Standard OCR-to-Markdown pipelines lose structural data and complicate downstream retrieval-augmented era (RAG). Granite-Docling emits DocTags—a compact, LLM-friendly structural grammar—which Docling converts into Markdown/HTML/JSON. This preserves desk topology, inline/floating math, code blocks, captions, and studying order with express coordinates, enhancing index high quality and grounding for RAG and analytics.
Inference and integration
- Docling Integration (advisable): The
docling
CLI/SDK routinely pulls Granite-Docling and converts PDFs/workplace docs/pictures to a number of codecs. IBM positions the mannequin as a element inside Docling pipelines quite than a common VLM. - Runtimes: Works with Transformers, vLLM, ONNX, and MLX; a devoted MLX construct is optimized for Apple Silicon. A Hugging Face House offers an interactive demo (ZeroGPU).
- License: Apache-2.0.
Why Granite-Docling?
For enterprise doc AI, small VLMs that protect construction cut back inference price and pipeline complexity. Granite-Docling replaces a number of single-purpose fashions (structure, OCR, desk, code, equations) with a single element that emits a richer intermediate illustration, enhancing downstream retrieval and conversion constancy. The measured good points—in TEDS for tables, F1 for code/equations, and diminished instability—make it a sensible improve from SmolDocling for manufacturing workflows.
Demo
Abstract
Granite-Docling-258M marks a major development in compact, structure-preserving doc AI. By combining IBM’s Granite spine, SigLIP2 imaginative and prescient encoder, and the nanoVLM coaching framework, it delivers enterprise-ready efficiency throughout tables, equations, code, and multilingual textual content—all whereas remaining light-weight and open-source below Apache 2.0. With measurable good points over its SmolDocling predecessor and seamless integration into Docling pipelines, Granite-Docling offers a sensible basis for doc conversion and RAG workflows the place precision and reliability are essential.
Take a look at the Models on Hugging Face and Demo here. Be happy to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Additionally, be happy to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.