CNL-TN-2026-049 Technical Note

olmOCR-2 as the Canonical PDF Ingestion Engine

Michael P. Hamilton , Ph.D.

Published: April 10, 2026 Version: 1

Abstract

The first-light OCR evaluation on April 10, 2026 (see REPORT.md in `tests/ocr_bench/`) compared two text extraction pipelines against four PDFs: PyMuPDF `fitz` for text-layer extraction and `glm-ocr` via local Ollama for vision-model OCR. That study recommended a standalone `ocr_document` worker wrapping glm-ocr as the Phase 1 build for the Collaboratory document pipeline. Before committing to that build, an evening follow-up benchmark was run against a third candidate: allenai olmOCR-2, the flagship OCR model from the Allen Institute, in the `mlx-community/olmOCR-2-7B-1025-5bit` port running natively on Apple Silicon through `mlx_vlm.server` and the `olmocr` CLI.

The follow-up benchmark added a second, harder corpus of three PDFs: a theconversation.com science article, a nine-page 2017 Elsevier review on microbial mat ecosystems with 101 references and dense chemistry notation, and a twenty-four-page OSU Extension gardening guide with a forty-five-row regional planting calendar. olmOCR-2 delivered clean academic apparatus (bibliographic headers, ToC with original pagination, hierarchical section numbering, inline citations, full references with DOI URLs, Unicode subscripts in chemistry notation), reconstructed the centerpiece forty-five-row by ten-column planting calendar as structured HTML, and correctly classified decorative stock photography as non-semantic (stripping the images while retaining their text attribution credits). Steady-state throughput held at approximately 28.7 seconds per page for text-dense content and 5.6 seconds per page for sparse slide decks. Zero failures across 37 pages in a single batch invocation.

The v0.2 recommendation is superseded. **olmOCR-2 is the canonical PDF-to-markdown path for the Collaboratory document pipeline.** Phase 1 is retargeted from a glm-ocr wrapper to a `read_document_olmocr` worker. Phase 2 (hybrid fitz-plus-glm-ocr orchestration with selective trigger logic) is no longer needed; olmOCR-2 subsumes both pipelines. This note documents the benchmark methodology, the evidence supporting the decision, the known caveats that remain open, and the forward work required to ship olmOCR-2 as a production worker.

---

Keywords

block verbatim

Access

Read Online (HTML) Download Source (Markdown)

AI Collaboration Disclosure

Claude (Anthropic ) — Analysis

This technical note was developed collaboratively with Claude (Anthropic, claude-opus-4-6) via Cowork. Claude conducted the olmOCR-2 installation walkthrough, ran the smoke tests and the 37-page batch benchmark, performed the structural analysis of the resulting markdown outputs, and drafted this note. The author staged the test corpus, approved each installation step, and takes full responsibility for the content, accuracy, and conclusions.

Human review: full

Cite This Document

Michael P. Hamilton, Ph.D. (2026). "olmOCR-2 as the Canonical PDF Ingestion Engine." Canemah Nature Laboratory Technical Note CNL-TN-2026-049. https://canemah.org/archive/CNL-TN-2026-049

BibTeX

@techreport{hamilton2026olmocr, author = {Hamilton, Michael P., Ph.D.}, title = {olmOCR-2 as the Canonical PDF Ingestion Engine}, institution = {Canemah Nature Laboratory}, year = {2026}, number = {CNL-TN-2026-049}, month = {april}, url = {https://canemah.org/archive/document.php?id=CNL-TN-2026-049}, abstract = {The first-light OCR evaluation on April 10, 2026 (see REPORT.md in `tests/ocr\_bench/`) compared two text extraction pipelines against four PDFs: PyMuPDF `fitz` for text-layer extraction and `glm-ocr` via local Ollama for vision-model OCR. That study recommended a standalone `ocr\_document` worker wrapping glm-ocr as the Phase 1 build for the Collaboratory document pipeline. Before committing to that build, an evening follow-up benchmark was run against a third candidate: allenai olmOCR-2, the flagship OCR model from the Allen Institute, in the `mlx-community/olmOCR-2-7B-1025-5bit` port running natively on Apple Silicon through `mlx\_vlm.server` and the `olmocr` CLI. The follow-up benchmark added a second, harder corpus of three PDFs: a theconversation.com science article, a nine-page 2017 Elsevier review on microbial mat ecosystems with 101 references and dense chemistry notation, and a twenty-four-page OSU Extension gardening guide with a forty-five-row regional planting calendar. olmOCR-2 delivered clean academic apparatus (bibliographic headers, ToC with original pagination, hierarchical section numbering, inline citations, full references with DOI URLs, Unicode subscripts in chemistry notation), reconstructed the centerpiece forty-five-row by ten-column planting calendar as structured HTML, and correctly classified decorative stock photography as non-semantic (stripping the images while retaining their text attribution credits). Steady-state throughput held at approximately 28.7 seconds per page for text-dense content and 5.6 seconds per page for sparse slide decks. Zero failures across 37 pages in a single batch invocation. The v0.2 recommendation is superseded. **olmOCR-2 is the canonical PDF-to-markdown path for the Collaboratory document pipeline.** Phase 1 is retargeted from a glm-ocr wrapper to a `read\_document\_olmocr` worker. Phase 2 (hybrid fitz-plus-glm-ocr orchestration with selective trigger logic) is no longer needed; olmOCR-2 subsumes both pipelines. This note documents the benchmark methodology, the evidence supporting the decision, the known caveats that remain open, and the forward work required to ship olmOCR-2 as a production worker.} }

Permanent URL: https://canemah.org/archive/document.php?id=CNL-TN-2026-049