Macroscope Collaboratory Librarian: Document Ingestion Pipeline Specification
Abstract
The Macroscope Collaboratory Librarian is the fourth panel of the I3 shell (Instruments, Investigations, Intelligence, Librarian) and serves as the research surface for a unified scientific document archive. This specification describes the worker-agent pipeline that ingests PDFs from a raw filesystem into the `macroscope.documents` table and the unified archive layout, transforming approximately 5,300 legacy PDFs (post-deduplication from an initial corpus of 8,515 files) into a searchable, metadata-rich, human-AI collaborative resource. The pipeline comprises three stages: a filesystem preprocessing layer that flattens, deduplicates on SHA-256, and rename-normalizes incoming PDFs into a stable Input directory; a two-worker ingestion stage in which `read_document_basic` converts PDF to Markdown and extracts figures using PyMuPDF and pdfplumber while `classify_document` runs a three-tier escalation ladder to generate bibliographic metadata; and a persistence stage that writes normalized records to the documents table with content-hash deduplication and archive-layout file placement. The classification ladder escalates from deterministic regex (Tier 1, zero cost) through Ollama gemma4:31b-cloud (Tier 2, local-daemon cloud inference) to Anthropic Claude Haiku with optional Sonnet escalation (Tier 3, paid API) before flagging unresolved documents for manual review (Tier 4). A dedicated `ingestion_batch_jobs` table tracks per-document state to provide crash resumability and a progress surface for the Librarian UI. This specification defines worker contracts, the v2.0 metadata schema, database extensions, escalation policy, cost envelope, and the downstream interface to the Librarian web panel.
---
**Keywords:** document ingestion pipeline; pdf text extraction; llm classification; metadata generation; three-tier escalation; ollama gemma; anthropic claude; content-hash deduplication; worker agent architecture; macroscope collaboratory; librarian ui; batch reprocessing
---
Keywords
- anthropic claude
- batch reprocessing
- content-hash deduplication
- document ingestion pipeline
- librarian ui
- llm classification
- macroscope collaboratory
- metadata generation
- ollama gemma
- pdf text extraction
- three-tier escalation
- worker agent architecture
Access
AI Collaboration Disclosure
This specification was developed with assistance from Anthropic Claude (Sonnet 4.6) in a Cowork session on April 11, 2026. The AI contributed to architecture formalization, worker contract documentation, schema enumeration, and manuscript drafting. The author takes full responsibility for the content, accuracy, and conclusions.
Human review: fullCite This Document
BibTeX
Permanent URL: https://canemah.org/archive/document.php?id=CNL-SP-2026-050