CNL-SP-2026-050 Specification

Macroscope Collaboratory Librarian: Document Ingestion Pipeline Specification

Michael P. Hamilton , Ph.D.
Published: April 11, 2026 Version: 1

Cite This Document

Michael P. Hamilton, Ph.D. (2026). "Macroscope Collaboratory Librarian: Document Ingestion Pipeline Specification." Canemah Nature Laboratory Specification CNL-SP-2026-050. https://canemah.org/archive/CNL-SP-2026-050

BibTeX

@manual{hamilton2026macroscope, author = {Hamilton, Michael P., Ph.D.}, title = {Macroscope Collaboratory Librarian: Document Ingestion Pipeline Specification}, institution = {Canemah Nature Laboratory}, year = {2026}, number = {CNL-SP-2026-050}, month = {april}, url = {https://canemah.org/archive/document.php?id=CNL-SP-2026-050}, abstract = {The Macroscope Collaboratory Librarian is the fourth panel of the I3 shell (Instruments, Investigations, Intelligence, Librarian) and serves as the research surface for a unified scientific document archive. This specification describes the worker-agent pipeline that ingests PDFs from a raw filesystem into the `macroscope.documents` table and the unified archive layout, transforming approximately 5,300 legacy PDFs (post-deduplication from an initial corpus of 8,515 files) into a searchable, metadata-rich, human-AI collaborative resource. The pipeline comprises three stages: a filesystem preprocessing layer that flattens, deduplicates on SHA-256, and rename-normalizes incoming PDFs into a stable Input directory; a two-worker ingestion stage in which `read\_document\_basic` converts PDF to Markdown and extracts figures using PyMuPDF and pdfplumber while `classify\_document` runs a three-tier escalation ladder to generate bibliographic metadata; and a persistence stage that writes normalized records to the documents table with content-hash deduplication and archive-layout file placement. The classification ladder escalates from deterministic regex (Tier 1, zero cost) through Ollama gemma4:31b-cloud (Tier 2, local-daemon cloud inference) to Anthropic Claude Haiku with optional Sonnet escalation (Tier 3, paid API) before flagging unresolved documents for manual review (Tier 4). A dedicated `ingestion\_batch\_jobs` table tracks per-document state to provide crash resumability and a progress surface for the Librarian UI. This specification defines worker contracts, the v2.0 metadata schema, database extensions, escalation policy, cost envelope, and the downstream interface to the Librarian web panel. --- **Keywords:** document ingestion pipeline; pdf text extraction; llm classification; metadata generation; three-tier escalation; ollama gemma; anthropic claude; content-hash deduplication; worker agent architecture; macroscope collaboratory; librarian ui; batch reprocessing} }

Permanent URL: https://canemah.org/archive/document.php?id=CNL-SP-2026-050