olmOCR-2 as the Canonical PDF Ingestion Engine
olmOCR-2 as the Canonical PDF Ingestion Engine
A Three-Way Benchmark Selecting the Document-to-Markdown Path for the Macroscope Collaboratory
Canemah Nature Laboratory Technical Note Series
Document ID: CNL-TN-2026-048 Version: 0.1 (First draft, same-day post-benchmark) Date: April 10, 2026 Author: Michael P. Hamilton, Ph.D. Affiliation: Canemah Nature Laboratory, Oregon City, Oregon
AI Assistance Disclosure: This technical note was developed collaboratively with Claude (Anthropic, claude-opus-4-6) via Cowork. Claude conducted the olmOCR-2 installation walkthrough, ran the smoke tests and the 37-page batch benchmark, performed the structural analysis of the resulting markdown outputs, and drafted this note. The author staged the test corpus, approved each installation step, and takes full responsibility for the content, accuracy, and conclusions.
Abstract
The first-light OCR evaluation on April 10, 2026 (see REPORT.md in tests/ocr_bench/) compared two text extraction pipelines against four PDFs: PyMuPDF fitz for text-layer extraction and glm-ocr via local Ollama for vision-model OCR. That study recommended a standalone ocr_document worker wrapping glm-ocr as the Phase 1 build for the Collaboratory document pipeline. Before committing to that build, an evening follow-up benchmark was run against a third candidate: allenai olmOCR-2, the flagship OCR model from the Allen Institute, in the mlx-community/olmOCR-2-7B-1025-5bit port running natively on Apple Silicon through mlx_vlm.server and the olmocr CLI.
The follow-up benchmark added a second, harder corpus of three PDFs: a theconversation.com science article, a nine-page 2017 Elsevier review on microbial mat ecosystems with 101 references and dense chemistry notation, and a twenty-four-page OSU Extension gardening guide with a forty-five-row regional planting calendar. olmOCR-2 delivered clean academic apparatus (bibliographic headers, ToC with original pagination, hierarchical section numbering, inline citations, full references with DOI URLs, Unicode subscripts in chemistry notation), reconstructed the centerpiece forty-five-row by ten-column planting calendar as structured HTML, and correctly classified decorative stock photography as non-semantic (stripping the images while retaining their text attribution credits). Steady-state throughput held at approximately 28.7 seconds per page for text-dense content and 5.6 seconds per page for sparse slide decks. Zero failures across 37 pages in a single batch invocation.
The v0.2 recommendation is superseded. olmOCR-2 is the canonical PDF-to-markdown path for the Collaboratory document pipeline. Phase 1 is retargeted from a glm-ocr wrapper to a read_document_olmocr worker. Phase 2 (hybrid fitz-plus-glm-ocr orchestration with selective trigger logic) is no longer needed; olmOCR-2 subsumes both pipelines. This note documents the benchmark methodology, the evidence supporting the decision, the known caveats that remain open, and the forward work required to ship olmOCR-2 as a production worker.
1. Background
1.1 What the first-light test established
The April 10 afternoon benchmark (tests/ocr_bench/REPORT.md, sections 1 through 9) compared fitz and glm-ocr over ten pages of four PDFs covering synthetic, academic, and slide-deck content. Raw metrics favored fitz by roughly a factor of 490 in speed and by character count on clean text-layer documents. Page-by-page inspection told a different story: glm-ocr reconstructed tables that fitz flattened into linear streams, normalized typographic ligatures that fitz propagated verbatim, repaired damaged text-layer artifacts that fitz dutifully copied (most notably a miscoded "5.5070" where the source page read "5.5%"), reflowed line-broken paragraphs into natural sentences, and filtered publisher boilerplate. On one axis glm-ocr lost: bibliographic reference list fidelity, where it captured only one DOI-like string versus fitz's five on a reference page.
The conclusion was a two-phase build. Phase 1: a standalone ocr_document worker wrapping glm-ocr for callers that already knew they wanted OCR quality. Phase 2: a hybrid pdf_ingest wrapper with per-page selective invocation of glm-ocr on pages where ligatures, detected tables, or sparse text signaled that fitz's output was inadequate.
1.2 Why a third candidate
Three facts motivated evaluating olmOCR-2 before committing to Phase 1. First, the Allen Institute released olmOCR-2 in late 2025 as their flagship OCR system specifically targeted at academic document ingestion, with particular attention to reference lists, tables, and scientific notation. Second, the mlx-community team published a 5-bit quantized MLX port (mlx-community/olmOCR-2-7B-1025-5bit) that runs natively on Apple Silicon through mlx-vlm, eliminating any need for an external API or a CUDA host. Third, the §4.5 reference-list gap in the first-light study was the single quality axis on which glm-ocr lost to fitz; if olmOCR-2 closed that gap, it would be a strictly better candidate than the Phase 1 path.
The evaluation was conducted in a narrowly scoped session focused on three questions: does olmOCR-2 install cleanly in a dedicated conda environment on Data, does the mlx_vlm.server plus olmocr CLI plumbing produce valid markdown output on representative PDFs, and is the output quality distinguishable from glm-ocr on the same corpus.
2. Environment
The olmOCR-2 evaluation environment was assembled from scratch:
- conda env
olmocr2at/Users/mikehamilton/miniforge3/envs/olmocr2/, Python 3.11.15 mlx-vlm 0.4.4(PyPI) providing the OpenAI-compatible/v1/chat/completionsendpoint viapython -m mlx_vlm.serverolmocr 0.4.27(PyPI) providing the workspace-based CLI withdone_flags/,worker_locks/,results/(JSONL), andmarkdown/(mirroring absolute source paths)mlx-community/olmOCR-2-7B-1025-5bitmodel, 6.2 GB snapshot, downloaded viahuggingface_hub.snapshot_downloadtorch 2.11.0andtorchvision 0.26.0required to instantiate the torch-backedQwen2VLVideoProcessorthat transformers' AutoProcessor walks from the model'svideo_preprocessor_config.json
The torch dependency was not obvious from the documentation and surfaced as an ImportError on the first mlx_vlm.server startup: transformers' AutoProcessor walks every sub-processor declared in the model config, and because olmOCR-2 is built on Qwen2.5-VL it inherits a video preprocessor config that requires torch even though olmOCR-2 itself is a still-image pipeline. A 80 MB arm64 torch wheel plus a 1.9 MB torchvision wheel resolved it on the second server startup, which logged Model and processor loaded successfully. and began serving requests.
The server was started with its default binding of 0.0.0.0:8080, which made it reachable over the home LAN and Tailscale. This is acceptable for a supervised smoke test but will be changed to --host 127.0.0.1 before the server is persisted as a launchd user agent.
3. Corpus
Five PDFs were evaluated across the afternoon and evening sessions. Two were reruns of the first-light corpus for apples-to-apples comparison with fitz and glm-ocr; three were staged specifically to stress-test academic apparatus, figure classification, and layout-heavy extraction.
| Label | File | Pages | Character of PDF |
|---|---|---|---|
| test1_baseline | Archive/Documents/test-pdfs/test1.pdf |
1 | Synthetic report with 5x5 data table, 1-page |
| electrify_now_slidedeck | Archive/Documents/PDF/ARC-PDF-000002/pdf00002.pdf |
17 | Sparse slide deck, image-heavy, low text density |
| first_contact_article | tests/ocr_bench/olmocr2/test-pdf/'First contact' ... .pdf |
4 | theconversation.com science article, Asgard archaea, 2026-04-09 |
| microbial_mats | tests/ocr_bench/olmocr2/test-pdf/main 10.pdf |
9 | 2017 Elsevier review, dense academic apparatus, 101 refs |
| growing_your_own | tests/ocr_bench/olmocr2/test-pdf/growing-your-ownedr2small.pdf |
24 | OSU Extension magazine-layout gardening guide |
The three new PDFs were chosen to cover capabilities the first-light corpus did not exercise: a dense academic review with an exhaustive reference list (to close the §4.5 reference fidelity gap), a magazine-layout extension guide with a centerpiece multi-region planting calendar (to stress table reconstruction beyond the synthetic 5x5 grid), and a current news article to probe the truncation-under-ambiguous-endings case.
4. Results
4.1 Throughput
Two distinct throughput regimes emerged, driven by output token count rather than input token count:
- Sparse content, ~5.6s/page. The 17-page Electrify Now slide deck completed in 94.77 seconds wall-clock, producing 7,422 characters of output (roughly 437 characters per page). Slides with one headline and a single illustration do not require olmOCR-2 to emit much markdown, so per-page cost is dominated by model setup and the vision-encoder pass rather than the autoregressive decoder.
- Text-dense content, ~28.7s/page. The 37-page evening batch (4 + 9 + 24 pages) completed in 1062.15 seconds, averaging 28.71 seconds per page. Academic prose, reference lists, and layout-heavy guide pages all sit in this regime. The decoder is the bottleneck: olmOCR-2 is emitting substantial markdown per page and autoregressive decoding scales linearly with output length.
Zero failures across all 37 pages. Zero fallback pages (olmocr's internal mechanism for when the model refuses a page). All pages completed on their first attempt.
The olmocr JSONL metadata reports total-input-tokens: 0 and total-output-tokens: 0 across all runs. This is a cosmetic reporting gap: olmocr's metrics hook expects vllm's native Prometheus format, which mlx_vlm.server does not emit. Inference is unaffected; we simply lack token-level accounting in the logs.
4.2 Academic apparatus
The microbial mats paper was the key evidence for §4.5 reference-list fidelity, which was the one axis on which glm-ocr had lost to fitz in the first-light study. olmOCR-2 captured:
- Full bibliographic header: authors, affiliations, corresponding-author email, article history (received/accepted/available online)
- Keywords block verbatim
- Abstract verbatim
- Table of contents preserving the original journal pagination (pages 49 through 55)
- Hierarchical section numbering intact through three levels (1, 1.1, 2, 2.1, 2.1.1)
- All 101 inline
[N]citations preserved in running text - All 101 references reconstructed at the end of the document with full DOI URLs
This closes the §4.5 gap by margin. fitz had captured five DOI-like strings on a reference page; olmOCR-2 captured 101 complete references.
4.3 Unicode scientific notation
The microbial mats paper contains chemistry notation in running text: O₂, H₂, CH₄, and (CH₂O)ₙ. A grep for Unicode subscript characters returned 8 hits in the output, zero mojibake. olmOCR-2 is emitting actual Unicode subscripts rather than flattening to O2 or H2O. For any downstream pipeline that indexes scientific literature or feeds an LLM trying to reason about stoichiometry, this preserves information that would otherwise require chemistry-aware post-processing to recover.
4.4 Figure anchors and classification
Two figure-handling behaviors are worth calling out.
On the microbial mats paper, Figure 1 is an informational diagram illustrating mat structure types. olmOCR-2 emitted a proper anchor  with the complete caption text retained in the alt-text slot and an accompanying narrative reference to the figure in the running prose.
On the OSU Extension gardening guide, zero image anchors were emitted despite 13 photographs appearing in the source PDF. Initial reading of the metrics suggested content loss. Inspection showed the opposite: the 13 Photo: <credit> attribution lines were retained as text, and the images themselves were correctly identified as decorative stock photography (clover, compost bins, vegetables) with no semantic information. This is the correct call for a text-extraction pipeline. olmOCR-2 is evidently classifying figures on an "is this informational?" axis rather than blanket-emitting every bitmap anchor. The classification choice is not documented in the olmOCR-2 README but the behavior is consistent across the corpus: informational figures got anchors, decorative figures did not, attribution credits were retained either way.
4.5 Table reconstruction at scale
The first-light study showed that glm-ocr could reconstruct a 5x5 synthetic data table while fitz could not. The evening benchmark pushed this test considerably harder with the OSU Extension "Dates for planting vegetables in Oregon" table, which spans the full centerfold of the gardening guide and is the booklet's primary data asset.
olmOCR-2 reconstructed it as a 45-row by 10-column HTML <table> spanning 508 lines of output. Columns: Vegetable, Start plants indoors this long before planting date, Planting dates, Region 1 (Coast, Astoria to Brookings), Region 2 (Western valleys, Portland to Roseburg), Region 3 (High elevations, mountains, and plateaus of central and eastern Oregon), Region 4 (Columbia and Snake valleys, Hermiston, Pendleton, Ontario), Amount to plant for family of four, Distance between rows, Distance apart in the row. The heavy abbreviation in date-range cells ("Aug.–Oct., May–June"), the region-specific "not suitable" cells, and the mix of numeric and textual entries all came through structured.
A second, smaller table (raised-bed nitrogen application rates, 4 product rows × 2 columns) was reconstructed with its footnote attached.
4.6 Comparison to first-light baselines
On the two reruns from the first-light corpus:
- test1_baseline: olmOCR-2 produced 1329 characters including an
figure anchor and a full HTML<table>with the 5x5 data grid. fitz had produced 719 characters with a flattened linear data stream. glm-ocr had produced 850 characters with a markdown table. olmOCR-2 matches glm-ocr on structure and adds the figure anchor. - electrify_now_slidedeck: olmOCR-2 processed all 17 pages in one pass, producing 7,422 characters with one image reference (a rebate map) and two HTML Watt Diet Calculator tables. fitz and glm-ocr in the first-light run had processed only 3 of 17 pages (
max_pages=3), making a direct per-page comparison unavailable. The 17-page olmOCR-2 output covers the deck end-to-end at 5.57 seconds per page, which is acceptable for an interactive workflow.
5. Caveats
Three known issues remain open.
5.1 First Contact article page 4 possible truncation
The four-page theconversation.com article (4,662 characters of output) appears to end mid-section at a "Weaving western science with Indigenous knowledge" header with no body content following it. The total word count (~900 words for four pages) is below the expected 1200-1500 word density for the publication's long-form science articles. Two hypotheses are possible: olmOCR-2 dropped the final page's body content, or the article's page 4 was in fact a thin wrap-up containing only the header and a call-to-action that olmOCR-2 correctly omitted. A side-by-side check against the source PDF is required before this can be closed. The test should be repeated on two or three other news-article PDFs to determine whether this is a class issue or a one-off.
5.2 Token counts report as zero
As noted in §4.1, olmocr's logs report zero input and output tokens because mlx_vlm.server does not emit Prometheus metrics in the format olmocr's metrics hook expects. This is cosmetic and does not affect inference, but it deprives downstream consumers of a per-page cost accounting signal. A patch to the mlx_vlm.server metrics endpoint would close this, but it is out of scope for the Collaboratory integration.
5.3 Server default binding
mlx_vlm.server binds to 0.0.0.0:8080 by default, which exposes it across the home LAN and Tailscale. For a supervised interactive smoke test this is acceptable; for a persisted launchd agent it is not. The plist for com.macroscope.olmocr_server should pass --host 127.0.0.1 --port 8080 explicitly, and the read_document_olmocr worker should connect to http://127.0.0.1:8080/v1/chat/completions.
6. Decision
olmOCR-2 is selected as the canonical PDF-to-markdown path for the Macroscope Collaboratory document ingestion pipeline.
The v0.2 Phase 1 recommendation (a glm-ocr wrapper worker) is superseded. The v0.2 Phase 2 recommendation (a hybrid fitz-plus-glm-ocr orchestration with per-page selective trigger logic) is no longer needed; olmOCR-2 subsumes both pipelines by handling clean text layers, damaged text layers, scanned content, tables, figures, and academic apparatus in a single pass at acceptable throughput on local hardware.
glm-ocr remains useful for ad-hoc single-page vision checks via the existing Ollama integration but is no longer recommended for primary document ingestion. fitz remains the right tool for pure text-layer extraction when speed dominates and structure does not matter — for example, when a caller needs only a full-text search index and will not attempt to reconstruct tables, figures, or reading order.
7. Forward Work
Four concrete next steps, in dependency order:
-
Scaffold
read_document_olmocrworker at~/Macroscope/Projects/Live/workers/workers/read_document_olmocr/worker.py. Thin wrapper following the echo and web_search worker patterns. Accepts a PDF path (and optionally a page range) as its job payload. Shells out to theolmocrCLI against a locally runningmlx_vlm.server, points olmocr at a job-scoped workspace under the worker framework's scratch directory, waits for the done flag, and returns the resulting markdown file path and metadata block. Target size: under 200 lines. No changes to existing workers. -
Persist
mlx_vlm.serveras a launchd user agent.~/Library/LaunchAgents/com.macroscope.olmocr_server.plistwith--host 127.0.0.1 --port 8080 --model mlx-community/olmOCR-2-7B-1025-5bit, RunAtLoad true, KeepAlive true, StandardOutPath and StandardErrorPath under/var/log/macroscope/olmocr/. The plist is the first persistent mlx-vlm process on Data and will require a short stress test (dispatch a batch of three to five PDFs, observe memory behavior) before it is left running. -
Re-verify First Contact article page 4. Before the worker ships, run a side-by-side comparison of the olmOCR-2 output against the source PDF to confirm whether the apparent truncation is content loss or correct behavior. Repeat on two or three additional news-article PDFs to calibrate.
-
Defer the scanned-PDF and handwriting corpus gaps from the first-light §6 to a follow-up evaluation specifically scoped to olmOCR-2. The first-light study called out three corpus gaps: a truly scanned PDF with no text layer, handwritten content, and reference-list fidelity at scale. The third is now closed by the microbial mats benchmark. The first two remain open and should be tested against olmOCR-2 before the worker is declared stable for archival ingestion of the CNL document collection.
8. Artifacts
The benchmark outputs are preserved at:
~/Macroscope/Projects/Workbench/Collaboratory/tests/ocr_bench/results/test1_baseline_olmocr2.md~/Macroscope/Projects/Workbench/Collaboratory/tests/ocr_bench/results/electrify_now_slidedeck_olmocr2.md~/Macroscope/Projects/Workbench/Collaboratory/tests/ocr_bench/results/first_contact_olmocr2.md~/Macroscope/Projects/Workbench/Collaboratory/tests/ocr_bench/results/microbial_mats_olmocr2.md~/Macroscope/Projects/Workbench/Collaboratory/tests/ocr_bench/results/growing_your_own_olmocr2.md
The raw olmocr workspace (JSONL results, done flags, worker locks, markdown mirroring source paths) is preserved at:
~/Macroscope/Projects/Workbench/Collaboratory/tests/ocr_bench/olmocr2/workspace/
The three new test PDFs are staged at:
~/Macroscope/Projects/Workbench/Collaboratory/tests/ocr_bench/olmocr2/test-pdf/
The three-way comparison narrative (fitz vs glm-ocr vs olmOCR-2) is documented in ~/Macroscope/Projects/Workbench/Collaboratory/tests/ocr_bench/REPORT.md §10.
End of CNL-TN-2026-048 v0.1.
Cite This Document
BibTeX
Permanent URL: https://canemah.org/archive/document.php?id=CNL-TN-2026-049