Zhipu AI Introduces GLM-OCR: A 0.9B Multimodal OCR Mannequin for Doc Parsing and Key Info Extraction (KIE)

Zhipu AI Introduces GLM-OCR: A 0.9B Multimodal OCR Mannequin for Doc Parsing and Key Info Extraction (KIE)


Why Doc OCR Nonetheless Stays a Onerous Engineering Drawback? What does it take to make OCR helpful for actual paperwork as a substitute of fresh demo photographs? And may a compact multimodal mannequin deal with parsing, tables, formulation, and structured extraction with out turning inference right into a useful resource bonfire?

That’s the downside focused by GLM-OCR, launched by researchers from Zhipu AI and Tsinghua College. The analysis staff presents GLM-OCR as a 0.9B-parameter compact multimodal mannequin for doc understanding. It combines a 0.4B CogViT visible encoder, a light-weight cross-modal connector, and a 0.5B GLM language decoder. The acknowledged objective is to steadiness doc recognition high quality with decrease latency and decrease computational value than bigger multimodal techniques.

Conventional OCR techniques are sometimes good at plain textual content transcription, however they battle when paperwork include blended layouts, tables, formulation, code blocks, seals, and structured fields. Latest multimodal giant language fashions enhance doc understanding, however the analysis staff argue that their dimension and normal autoregressive decoding make them costly for edge deployment and large-scale manufacturing. GLM-OCR is positioned as a smaller system constructed for these deployment constraints quite than as a general-purpose vision-language mannequin tailored to OCR as an afterthought.

A Compact Structure Constructed for OCR Workloads

A key technical level for this analysis is using Multi-Token Prediction (MTP). Customary autoregressive decoding predicts one token at a time, which isn’t perfect for OCR-style duties the place outputs are sometimes deterministic and regionally structured. GLM-OCR as a substitute predicts a number of tokens per step. The mannequin is skilled to foretell 10 tokens per step and generates 5.2 tokens per decoding step on common at inference time, yielding about 50% throughput enchancment. To maintain reminiscence overhead manageable, the implementation makes use of a parameter-sharing scheme throughout the draft fashions.

Two-Stage Format Parsing As a substitute of Flat Web page Studying

On the system degree, GLM-OCR adopts a two-stage pipeline. The primary stage makes use of PP-DocLayout-V3 for structure evaluation, which detects structured areas on the web page. The second stage performs parallel region-level recognition over these detected areas. That is vital as a result of the mannequin isn’t merely studying an entire web page left-to-right as a generic vision-language mannequin would possibly. It first breaks down the web page into semantically significant areas, which improves effectivity and makes the system extra sturdy on paperwork with difficult layouts.

Doc Parsing and KIE Use Completely different Output Paths

The structure additionally separates two associated doc duties. For doc parsing, the pipeline makes use of structure detection and area processing to supply structured outputs resembling Markdown and JSON. For Key Info Extraction (KIE), the analysis staff describes a unique path: the complete doc picture is fed to the mannequin with a activity immediate, and the mannequin instantly generates JSON containing the extracted fields. That distinction issues as a result of GLM-OCR isn’t introduced as a single monolithic page-to-text mannequin. It’s a structured era system with totally different working modes relying on the duty.

A 4-Stage Coaching Pipeline with Process-Particular Rewards

The coaching recipe is cut up into 4 phases. Stage 1 trains the imaginative and prescient encoder on image-text pairs and grounding or retrieval information. Stage 2.1 performs multimodal pretraining on image-text, doc parsing, grounding, and VQA information. Stage 2.2 provides the MTP goal. Stage 3 is supervised fine-tuning on OCR-specific duties together with textual content recognition, method transcription, desk construction restoration, and KIE. Stage 4 applies reinforcement studying utilizing GRPO. The reward design is task-specific: Normalized Edit Distance for textual content recognition, CDM rating for method recognition, TEDS rating for desk recognition, and field-level F1 for KIE, together with structural penalties resembling repetition penalties, malformed construction penalties, and JSON validation constraints.

Benchmark Outcomes Present Sturdy Efficiency, With Essential Caveats

On public benchmarks, GLM-OCR studies robust outcomes throughout a number of doc duties. It scores 94.6 on OmniDocBench v1.5, 94.0 on OCRBench (Textual content), 96.5 on UniMERNet, 85.2 on PubTabNet, and 86.0 on TEDS_TEST. For KIE, it studies 93.7 on Nanonets-KIE and 86.1 on Handwritten-KIE. The analysis staff notes that outcomes for Gemini-3-Professional and GPT-5.2-2025-12-11 are proven just for reference and are excluded from the best-score rating, which is a crucial element when decoding claims about mannequin management.

https://arxiv.org/pdf/2603.10910

The benchmark story is powerful, nevertheless it wants cautious phrasing. GLM-OCR achieves the best reported scores among the many evaluated non-reference fashions on OmniDocBench v1.5, OCRBench (Textual content), UniMERNet, and TEDS_TEST. On PubTabNet, nonetheless, it does not lead general; MinerU 2.5 studies 88.4 versus GLM-OCR’s 85.2. For KIE, GLM-OCR outperforms the listed open-source opponents within the above desk, however Gemini-3-Professional scores larger on each Nanonets-KIE and Handwritten-KIE within the reference column. So the reserach staff helps a robust aggressive declare, however not a blanket ‘greatest at every little thing’ declare.

Deployment Particulars

The analysis staff state that GLM-OCR helps vLLM, SGLang, and Ollama, and may be fine-tuned by means of LLaMA-Manufacturing facility. In addition they report throughput of 0.67 photographs/s and 1.86 PDF pages/s below their analysis setup. As well as, they describe a MaaS API priced at 0.2 RMB per million tokens, with instance value estimates for scanned photographs and simple-layout PDFs. These particulars counsel that GLM-OCR is being framed as each a analysis mannequin and a deployable system.

Key Takeaways

  • GLM-OCR is a compact 0.9B multimodal OCR mannequin constructed with a 0.4B CogViT encoder and 0.5B GLM decoder.
  • It makes use of Multi-Token Prediction (MTP) to enhance decoding effectivity, reaching 5.2 tokens per step on common and about 50% larger throughput.
  • The mannequin makes use of a two-stage pipeline: PP-DocLayout-V3 handles structure evaluation, then GLM-OCR performs parallel region-level recognition.
  • It helps each doc parsing and KIE: parsing outputs Markdown/JSON, whereas KIE instantly generates JSON from the complete doc picture.
  • Benchmark outcomes are robust however not common wins: GLM-OCR leads a number of reported non-reference benchmarks, however MinerU 2.5 is larger on PubTabNet, and Gemini-3-Professional is larger on the reference-only KIE scores.

Try Paper, Repo and Model PageAdditionally, be at liberty to observe us on Twitter and don’t neglect to hitch our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.




Source link

Leave a Reply

Your email address will not be published. Required fields are marked *