CNL-SP-2026-050 Specification

Macroscope Collaboratory Librarian: Unified Catalog and Document Ingestion Pipeline Specification

Michael P. Hamilton , Ph.D.

Published: April 11, 2026 Version: 2

Macroscope Collaboratory Librarian: Unified Catalog and Document Ingestion Pipeline Specification

Document ID: CNL-SP-2026-050
Version: 0.2
Date: April 11, 2026
Author: Michael P. Hamilton, Ph.D.
Status: Draft for review
Visibility: Public

AI Assistance Disclosure: This specification was developed with assistance from Anthropic Claude (Opus 4.6) in Cowork sessions on April 11, 2026. The AI contributed to architecture formalization, worker contract documentation, schema design, CSV column mapping, and manuscript drafting. The author takes full responsibility for the content, accuracy, and conclusions.

Abstract

The Macroscope Collaboratory Librarian is the fourth panel of the I3 shell (Instruments, Investigations, Intelligence, Librarian) and serves as the research surface for a unified personal holdings catalog and scientific document archive. This specification describes an extensible catalog model that supports four item types — documents, books, videos, and TV shows — unified under a single catalog_items supertable with per-type extension tables and a shared asset layer, plus the worker-agent pipeline that ingests PDFs from a raw filesystem into the catalog. The initial corpus comprises approximately 5,300 legacy PDFs (post-deduplication from 8,515 files), 1,311 books cataloged in BookBuddy, and 652 movies and TV shows cataloged in MovieBuddy, totaling approximately 7,264 items at launch. The document ingestion pipeline comprises three stages: a filesystem preprocessing layer that flattens, deduplicates on SHA-256, and rename-normalizes incoming PDFs; a two-worker ingestion stage in which read_document_basic converts PDF to Markdown using PyMuPDF and pdfplumber while classify_document runs a three-tier escalation ladder to generate bibliographic metadata; and a persistence stage that writes normalized records to the catalog with content-hash deduplication and archive-layout file placement. The classification ladder escalates from deterministic regex (Tier 1, zero cost) through Ollama gemma4:31b-cloud (Tier 2, local-daemon cloud inference) to Anthropic Claude Haiku with optional Sonnet escalation (Tier 3, paid API) before flagging unresolved documents for manual review (Tier 4). A dedicated ingestion_batch_jobs table tracks per-document state to provide crash resumability and a progress surface for the Librarian UI. Book and video records are imported from BookBuddy and MovieBuddy CSV exports via a parallel CSV import pipeline. The catalog model is designed to accommodate future item types (audio recordings, photographs, field specimens) without schema changes to the core supertable. This specification defines the catalog data model, worker contracts, the v2.0 metadata schema, database schema, escalation policy, cost envelope, CSV import pipeline, and the downstream interface to the Librarian web panel.

Keywords: unified catalog; document ingestion pipeline; extensible schema; pdf text extraction; llm classification; metadata generation; three-tier escalation; ollama gemma; anthropic claude; content-hash deduplication; worker agent architecture; macroscope collaboratory; librarian ui; csv import; bookbuddy; moviebuddy

1. Introduction

1.1 Context

The Macroscope Collaboratory is a four-panel research application hosted inside the Claude Desktop app via a dedicated Model Context Protocol (MCP) server. The application is organized around the I3 shell: Instruments (live sensor dashboards across EARTH, LIFE, HOME, SELF domains), Investigations (structured human-AI research workflows), Intelligence (the STRATA micro-agent layer and narrative synthesis), and Librarian (the unified holdings catalog and research-grade reading surface).

The Librarian panel is the newest of the four and subsumes both a document corpus accumulated over approximately two decades across research projects at the James San Jacinto Mountains Reserve, Blue Oak Ranch Reserve, the Center for Embedded Networked Sensing (CENS), the Canemah Nature Laboratory, and personal research activity; a personal book library cataloged in BookBuddy; and a movie and TV show collection cataloged in MovieBuddy. The document corpus includes peer-reviewed journal articles, conference papers, arXiv preprints, government reports, agency technical documents, field manuals, and internal working notes. As of April 2026 the document archive contains 5,299 unique PDFs totaling 32.27 GB after dedup, the book catalog contains 1,311 titles across 1,000 unique authors, and the movie/TV catalog contains 652 titles with rich metadata including TMDB and IMDB identifiers.

The Librarian builds on a prior project, "The Librarian" (Hypatia), developed circa 2024–2025 within the MacroNexus/MEO ecosystem. Hypatia was a personal-use book catalog and AI-assisted search interface using Ollama gemma3:4b and a "Hypatia" librarian AI persona. Key design precedents from that project — the ARC-XXX-NNNNNN canonical identifier scheme, the pdf_documents table schema, three FULLTEXT indices, natural-language intent-to-query translation, per-item chat with session-based memory, and the REGEXP/LIKE fallback search layer — are inherited and formalized in this specification.

1.2 Objectives

This specification defines the extensible catalog model and the end-to-end document ingestion pipeline that transforms raw PDFs, BookBuddy CSVs, and MovieBuddy CSVs into structured, searchable records in librarian_db and the unified archive filesystem layout. Specifically, the system must:

Provide a single unified catalog (catalog_items) that accommodates documents, books, videos, TV shows, and future item types without schema changes to the core table.
Preserve per-type specialized metadata in extension tables (items_document, items_book, items_video, items_tv_show) while sharing universal ownership and acquisition fields.
Track file assets (PDFs, ebooks, video files, cover images) in a shared item_assets table linked to catalog items.
Normalize document filenames to ASCII-safe identifiers while preserving original names as recoverable data.
Deduplicate document content byte-for-byte via SHA-256 before any expensive processing.
Convert each PDF to a Markdown representation with figure extraction.
Generate bibliographic metadata using the cheapest reliable method available per document via a three-tier escalation ladder.
Persist all document artifacts under a stable archive path keyed by canonical identifier.
Import book records from BookBuddy CSV exports with full field mapping.
Import movie and TV show records from MovieBuddy CSV exports with full field mapping including structured cast JSON.
Track batch-reprocessing state in a database table that supports resumability after crashes and provides a real-time progress view to the Librarian UI.
Reuse the same worker contracts for single-document streaming ingestion as the archive grows post-migration.

1.3 Scope

This specification covers the extensible catalog data model (Section 8), the preprocessing tools (Section 3), the worker-agent ingestion stage (Sections 4–6), the metadata schema (Section 7), the archive filesystem layout (Section 9), the batch reingest workflow (Section 10), the Librarian UI integration (Section 11), the CSV import pipeline (Section 14), and open questions (Section 12). It does not cover the Librarian UI implementation itself, the STRATA Intelligence integration, or the Claude Desktop MCP server glue — each of which is documented in separate notes referenced in Section 13.

1.4 Non-Goals

The following are explicitly out of scope for this specification:

OCR of scanned or image-only PDFs. The current pipeline assumes extractable text. A dedicated tier for OCR-required documents is deferred to a future extension.
Multi-language support. The pipeline targets English-language scientific documents. Book and movie records in other languages are stored as-is but not processed for full-text search.
Full-text search backend selection beyond MySQL FULLTEXT. The three-index strategy (narrow metadata, full body, combined) established by the original Librarian is adopted; external search engines are a downstream decision.
Access control beyond the existing Collaboratory three-tier auth (anonymous, user, admin).
Streaming video playback or audio playback infrastructure.
Automated metadata enrichment from external APIs (Google Books, TMDB) beyond initial CSV import. Manual enrichment is supported via the Librarian UI.

2. System Architecture

2.1 High-Level Pipeline

 Legacy archive / CSV exports     Preprocessing / Import          Worker ingestion             Persistence              Librarian UI
 (nested PDFs in Archive/,   →    (flatten, dedup, rename   →     (read_document_basic,  →     (librarian_db,     →    (search,
  BookBuddy CSV,                   for PDFs; CSV parse            classify_document              Archive/                 filters,
  MovieBuddy CSV)                  for books/movies)              for documents only)            documents/)              viewer,
                                                                                                                         Hypatia chat)

Each stage produces durable artifacts that survive crashes in the next stage: preprocessing writes JSONL manifests, CSV import writes per-batch import logs, worker ingestion writes per-document Markdown and metadata files and database rows, and persistence writes to the unified archive layout. The pipeline is reentrant at every boundary.

2.2 Component Inventory

Component	Location	Role
`flatten_pdfs.py`	`collectors/batch_reingest/`	Walk nested source tree, flatten PDFs into Input/, write flatten manifest
`dedup_pdfs.py`	`collectors/batch_reingest/`	SHA-256 exact-duplicate pass, write dedup manifest
`rename_pdfs.py`	`collectors/batch_reingest/`	Assign sequential ASCII-safe filenames, write rename manifest
`read_document_basic`	`workers/read_document_basic/`	PDF-to-Markdown conversion, figure extraction
`classify_document`	`workers/classify_document/`	Three-tier LLM classification and metadata generation
`batch_reingest.py`	`collectors/batch_reingest/`	Orchestrate the two workers across the full document corpus
`import_books.py`	`collectors/batch_reingest/`	BookBuddy CSV import to catalog_items + items_book
`import_movies.py`	`collectors/batch_reingest/`	MovieBuddy CSV import to catalog_items + items_video + items_tv_show
`librarian_db`	MySQL	Unified catalog database
`Archive/documents/`	Filesystem	Unified per-document artifact layout
Librarian UI	`Collaboratory/librarian/`	Search, filter, pipeline dashboard, Hypatia chat, manual review queue

2.3 Worker Framework Context

The two ingestion workers execute inside the Macroscope worker framework, a filesystem job queue at ~/Macroscope/Projects/Live/workers/ managed by launchd. Jobs are dispatched by an MCP tool (dispatch_worker_job) that writes a JSON payload to the worker inbox, and the worker writes its result as {job}.result.json on completion. The queue provides isolation between the MCP layer and long-running ingestion work, allows streaming jobs to be triggered from Claude Desktop conversations without blocking the UI, and gives launchd a uniform supervision target.

3. Preprocessing Pipeline

The preprocessing tools are filesystem-only utilities that prepare a clean, stable Input directory before any worker-based ingestion begins. Each tool is invoked manually, is dry-run by default, accepts --execute to commit changes, and writes a JSONL manifest that the downstream batch reingest tool consumes. All operations are reversible; nothing is deleted.

3.1 flatten_pdfs.py

Walks ~/Macroscope/Archive/Documents/PDF/ recursively, moves every .pdf file (case-insensitive) to ~/Macroscope/Archive/Documents/Input/ as a flat directory, and writes flatten_manifest.jsonl recording every move. Filename collisions are resolved by appending _1, _2, etc. to the stem. Non-PDF files are left in place. The original nested directory is preserved until the operator manually removes it.

Empirical result on the April 11, 2026 run: 8,515 PDFs flattened, 788 name collisions auto-suffixed, 50.97 GB total.

3.2 dedup_pdfs.py

Hashes every PDF in Input/ with SHA-256 (1 MB streaming chunks), groups files by hash, selects a canonical member per group (shortest filename with lexicographic tiebreak), and moves non-canonical aliases to ~/Macroscope/Archive/Documents/Input_duplicates/. Writes dedup_manifest.jsonl in Input/ with one record per unique document containing the canonical path, file size, alias count, and full list of alias filenames. Only catches byte-for-byte duplicates; near-duplicate detection via normalized-text SHA-256 is performed later during worker ingestion (see Section 6.7).

Empirical result: 5,298 unique documents, 3,217 aliases moved, 18.70 GB reclaimed.

3.3 rename_pdfs.py

Assigns sequential ASCII-safe filenames of the form input_NNNNN.pdf (five-digit zero-padded, lowercase) to every PDF in Input/ in alphabetic order and writes rename_manifest.jsonl in Input/ with one record per file containing sequence, new_name, original_name, and size_bytes. Uses ensure_ascii=False to preserve original filenames with Unicode punctuation, emoji, embedded newlines, or leading whitespace. Refuses to run if the manifest already exists or if any existing filename matches the target pattern, enforcing single-shot idempotency.

Empirical result: 5,299 files renamed (includes one file manually restored from Input_duplicates).

3.4 Manifest Contract

The three manifests together form a complete audit trail for every byte in the Input directory. The batch reingest tool reads all three at startup and constructs the mapping from sequential input filename to original filename, alias list, and pre-flatten source path. The original filename becomes the value of catalog_items.original_filename and is preserved verbatim regardless of what characters it contained.

4. Worker Agent Architecture

4.1 Worker Framework Overview

Each worker is a Python module at workers/<worker_name>/worker.py with a standard entry point that takes a single argument: the path to a JSON job file in the worker inbox. The worker reads the job, performs its task, and writes a sibling .result.json file. On success the job file is moved to done/; on failure it is moved to failed/ with a .error.json sibling containing stderr and traceback. The worker framework supports concurrent execution via launchd agent cloning but is run single-threaded for ingestion to avoid rate-limit contention on external APIs.

4.2 Job Contract

Every worker job carries the following minimum envelope:

{
  "job_id": "uuid-v4",
  "job_type": "read_document_basic",
  "payload": { "...": "..." },
  "meta": {
    "caller": "batch_reingest",
    "batch_job_id": 4271,
    "created_at": "2026-04-11T18:00:00Z"
  }
}

The payload structure is worker-specific. The meta.batch_job_id references the row in ingestion_batch_jobs so the orchestrator can update per-document state on job completion.

4.3 Two-Worker Split Rationale

Document extraction (PDF to Markdown, figure extraction, page counting) and document classification (bibliographic metadata generation) are split into separate workers for three reasons:

Cost isolation. Extraction is local-CPU work with no external API cost. Classification may involve Anthropic API calls charged per token. Splitting the workers lets the orchestrator retry classification without repaying extraction.
Resumability. If classification fails midway through the batch, the orchestrator can re-run classify_document against the already-extracted Markdown without re-reading the source PDF.
Model substitution. A future worker variant (e.g., read_document_ocr for image-only PDFs) can be slotted in upstream of classification without changes to the classifier.

The read_document_olmocr worker, which provided heavier LLM-based extraction for scanned and image-only PDFs, is retired in this specification and its source is preserved under Projects/Reference/ for historical consultation.

5. Document Extraction (`read_document_basic`)

5.1 Extraction Engine

The worker wraps EnhancedPDFExtractor located at ~/Macroscope/Projects/Live/collectors/scripts/enhanced_pdf_extractor.py with the use_llm=False flag. The extractor uses PyMuPDF (fitz) as the primary engine for text and layout, falling back to pdfplumber for table detection when PyMuPDF's layout heuristics indicate tabular content. No LLM is invoked at this stage; all classification is delegated to the downstream classify_document worker.

5.2 Payload Contract

{
  "pdf_path": "/Users/.../Input/input_00042.pdf",
  "output_markdown_path": "/tmp/.../input_00042.md",
  "output_metadata_path": "/tmp/.../input_00042.meta.json",
  "output_figures_dir": "/tmp/.../input_00042.figures/",
  "workspace_dir": "/tmp/.../workspace/"
}

5.3 Output Artifacts

Markdown file — the extracted text, with headings inferred from font-size hierarchy, embedded image placeholders (![figure_N](figure_N.png)), and table blocks in pipe syntax where detected.
Metadata JSON — shallow extraction-stage metadata: num_pages, num_figures, num_tables, page_size, producer, pdf_version, has_toc, has_embedded_text, plus any bibliographic hints discovered in the PDF's own /Info dictionary (title, author, subject, keywords).
Figures directory — one file per extracted figure, named figure_NN.png (or original format where preserved).

5.4 Limitations

The basic extractor does not perform OCR, will return empty Markdown for image-only PDFs, and does not attempt to reconstruct column order in complex multi-column layouts. Documents flagged as "no extractable text" by PyMuPDF are written to ingestion_batch_jobs with state = 'extraction_failed' and classification_status = 'manual_review', bypassing the classifier entirely.

6. Three-Tier Classification Ladder (`classify_document`)

6.1 Escalation Policy

The classify_document worker attempts bibliographic metadata generation through a sequence of methods ordered by cost and reliability, advancing to the next tier only when the current tier returns insufficient confidence or fails entirely. The policy is deterministic: given the same input Markdown, a document takes the same path through the ladder on every run.

Confidence is expressed as an enumeration (low, medium, high) and the worker advances if the current tier returns low. Each tier has the opportunity to supplement fields already filled by earlier tiers rather than replacing them wholesale (see Section 6.6).

6.2 Tier 1: Deterministic Regex

A pure-Python module inspects the first 4 kilobytes of extracted Markdown and the original filename for deterministic bibliographic patterns: arXiv identifiers (arXiv:YYMM.NNNNN), DOI prefixes (10.NNNN/...), standard journal-article headers (title on line N, authors on line N+1, affiliations indented), USGS and NOAA report numbers, and a small set of common preprint server URLs. When a pattern is found, the module resolves the bibliographic record by local lookup (no network call) and returns a metadata dict with extraction_confidence = 'high' and extraction_method = 'regex'.

Approximately 40–60% of well-formatted post-2005 journal articles resolve at Tier 1 with zero LLM cost. The tier is always attempted first; it is free and fast.

6.3 Tier 2: Ollama gemma4:31b-cloud

When Tier 1 returns low confidence, the worker sends the first 8,000 characters of extracted Markdown (plus the filename and any Tier 1 partial metadata) to a local Ollama daemon running on localhost:11434/api/chat with model = "gemma4:31b-cloud". The model is served by Ollama Turbo as a cloud-hosted inference endpoint; authentication is handled transparently by the local daemon using a device key obtained via ollama signin, so no Bearer token appears in the Python code path.

Request parameters: format = "json", num_ctx = 8192, num_predict = 1024, keep_alive = "30m". Transport is urllib HTTPS POST with a 60-second timeout. The system prompt instructs the model to return a strict JSON object matching the v2.0 metadata schema, with empty strings for fields it cannot determine.

Measured latency (April 11, 2026 validation): approximately 0.7 seconds for trivial inputs, 1–3 seconds for typical journal articles. Cost is zero at the point of call, absorbed by the Ollama Turbo subscription already paid for at the infrastructure level.

The returned metadata is merged with Tier 1 output, the extraction method is recorded as llm with llm_model = "gemma4:31b-cloud", and confidence is set from the model's own self-assessment clamped to the valid enumeration range.

6.4 Tier 3: Anthropic Claude Haiku / Sonnet

When Tier 2 returns low confidence or produces malformed JSON, the worker invokes the Anthropic API with Claude Haiku 4.5 as the default model. The same first 8,000 characters of Markdown are sent with a strict system prompt, and the response is parsed as JSON. If Haiku also returns low confidence or the document exceeds an empirically-determined complexity threshold (length, number of authors, presence of non-English fragments), the worker escalates a second time to Claude Sonnet 4.6 for a final attempt.

API credentials are loaded from an environment variable at worker start; the key never appears in job payloads or result files. Token usage is captured from the response and recorded in ingestion_batch_jobs for cost accounting.

Cost envelope (estimate, April 2026 pricing): Haiku at approximately $0.001 per document, Sonnet at approximately $0.01 per document. For a 5,300-document backfill with an expected 10% Tier 3 rate and 20% of those escalating to Sonnet, total Tier 3 cost is on the order of $10–$15 for the full batch.

6.5 Tier 4: Manual Review Flag

If Tier 3 Sonnet also fails to return adequate metadata, the worker sets classification_status = 'manual_review' on the catalog_items row and writes a human-readable explanation to classification_notes describing why each tier was insufficient. The document is still written to the archive with partial metadata (whatever Tier 1 and Tier 2 produced), and the Librarian UI surfaces it in an admin queue for human curation.

6.6 Metadata Merge Semantics

Because each tier may run even after a partial Tier 1 resolution, the worker uses additive merge semantics: later-tier values supplement but do not overwrite earlier-tier values unless the earlier value was an empty string. The exception is extraction_confidence, which uses a max-rank semantics (low < medium < high) so that any high-confidence field elevates the document's overall confidence even if other fields remain uncertain. The merge logic is implemented in classify_document/worker.py::merge_metadata().

6.7 Normalized-Text Deduplication

After extraction but before tier dispatch, the worker computes sha256_text, a SHA-256 over the normalized Markdown (lowercase, whitespace collapsed, headings stripped). If this hash matches an existing row in items_document, the worker short-circuits: no classification is performed, and the new ingestion is recorded as an alias of the existing document in document_aliases. This catches near-duplicates that differ only in PDF rendering but carry identical text content — a common case with preprint-to-published transitions and multiple journal reprints.

7. Metadata Schema v2.0

The v2.0 metadata schema is the canonical format for the per-document metadata.json file stored alongside each PDF in the archive. It is structured as six top-level sections, each populated by a specific pipeline stage.

7.1 source

Populated by the preprocessing and extraction stages. Fields: original_filename, source_type (one of legacy_archive, streaming_upload, api_import), source_path (pre-flatten filesystem path where the file was first discovered), source_url (empty for legacy archive, set for API-imported documents), ingested_by (operator identifier), ingested_at (ISO timestamp), archive_path (target path under Archive/documents/).

7.2 bibliography

Populated by the classification stage. Fields: title (string), authors (ordered array of strings), doi (nullable string), journal (nullable string), publication_date (ISO date, nullable), abstract (string), keywords (array of strings), document_type (enum: journal_article, conference_paper, preprint, book, book_chapter, report, thesis, working_paper, field_note, other).

7.3 classification

Populated by the classification stage. Fields: extraction_method (regex | llm | hybrid | manual), extraction_confidence (low | medium | high), llm_model (nullable string naming the model that produced the final metadata), classification_status (pending | complete | manual_review), classification_notes (free text), tier1_attempted, tier2_attempted, tier3_attempted (booleans), token_usage_input, token_usage_output (integers, for Tier 3 only).

7.4 extraction

Populated by read_document_basic. Fields: num_pages, num_figures, num_tables, num_references, page_size, producer, pdf_version, has_toc, has_embedded_text, extraction_duration_ms.

7.5 visuals

Populated by read_document_basic. Fields: figures_path (relative path to figures directory under archive), visuals_sidecar_path (path to visuals.json sidecar listing each figure with its filename, page number, and placeholder caption). Detailed per-figure metadata (captions, bounding boxes, OCR text) is deferred to a future pipeline extension.

7.6 content_hashes

Populated by read_document_basic and stored alongside the catalog item row. Fields: sha256_pdf (hash of source bytes, used for exact-duplicate detection), sha256_text (hash of normalized Markdown, used for near-duplicate detection), size_bytes.

8. Extensible Catalog Model and Database Schema

8.1 Design Rationale

The original Librarian (Hypatia, circa 2024–2025) maintained separate tables for books (My_Books, 23 columns) and PDF documents (pdf_documents, 20 columns) with no shared identity layer. The new Librarian unifies all holdings under a single supertable (catalog_items) that carries the universal fields shared by every item type, plus per-type extension tables that carry specialized metadata, plus a shared asset table for files attached to items. This design supports the four launch item types (document, book, video, tv_show) and accommodates future types (audio recordings, photographs, field specimens) by adding a new extension table and a new item_type enum value without altering the core supertable.

8.2 Database: `librarian_db`

All catalog tables reside in a new MySQL database named librarian_db, created on both Data and Galatea. This database supersedes both the old Archive_DB (Hypatia) and the macroscope.documents legacy table. Existing data in those tables is migrated by the CSV import pipeline (Section 14) and the batch reingest workflow (Section 10) respectively; the legacy tables are not dropped but are no longer written to.

8.3 Canonical Identifiers

Every catalog item receives a canonical identifier derived from its auto-increment id and item_type:

Item Type	Prefix	Format	Example
document	ARC-DOC	ARC-DOC-NNNNNN	ARC-DOC-000042
book	ARC-BOK	ARC-BOK-NNNNNN	ARC-BOK-001311
video	ARC-VID	ARC-VID-NNNNNN	ARC-VID-000614
tv_show	ARC-TVS	ARC-TVS-NNNNNN	ARC-TVS-000028

The prefix convention inherits from Hypatia's ARC-PDF-NNNNNN scheme (renamed to ARC-DOC to be format-neutral, since future documents may be EPUB, HTML, or plain text). Hypatia's other prefixes (ARC-TAG, ARC-DIG, ARC-QOU) are available for reuse if the tag and quotes subsystems are reintroduced.

The canonical_id is computed as CONCAT(prefix, '-', LPAD(id, 6, '0')) and stored on the row. It is the stable external reference used in archive paths, UI URLs, and cross-references.

8.4 `catalog_items` Table (Supertable)

CREATE TABLE catalog_items (
    id              BIGINT AUTO_INCREMENT PRIMARY KEY,
    canonical_id    VARCHAR(20) NOT NULL,
    item_type       ENUM('document','book','video','tv_show') NOT NULL,
    title           VARCHAR(512) NOT NULL,
    subtitle        VARCHAR(512) NULL,
    creators_json   JSON NULL,            -- [{name, role}] — authors, directors, editors, etc.
    year            SMALLINT NULL,
    genre           VARCHAR(256) NULL,     -- comma-separated, matching Buddy app convention
    summary         TEXT NULL,
    keywords        TEXT NULL,
    language        VARCHAR(64) NULL,
    original_language VARCHAR(64) NULL,
    cover_image_path TEXT NULL,
    physical_location VARCHAR(256) NULL,
    status          VARCHAR(32) NOT NULL DEFAULT 'cataloged',
                    -- cataloged, on_shelf, loaned, lost, wishlist, digital_only,
                    -- watched, unread, read, reading, not_watched
    rating          DECIMAL(3,1) NULL,     -- 0.0–10.0
    acquired_at     DATE NULL,
    cost            DECIMAL(10,2) NULL,
    notes           TEXT NULL,
    tags            TEXT NULL,             -- comma-separated
    original_filename TEXT NULL,           -- preserved verbatim from source
    import_source   VARCHAR(64) NULL,      -- 'bookbuddy', 'moviebuddy', 'ingestion_pipeline', 'manual'
    import_batch_id VARCHAR(64) NULL,      -- links back to the import run
    classification_status ENUM('pending','complete','manual_review') NOT NULL DEFAULT 'pending',
    classification_notes TEXT NULL,
    created_at      DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at      DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    edited_by       VARCHAR(64) NULL,

    UNIQUE KEY uk_canonical_id (canonical_id),
    KEY idx_item_type (item_type),
    KEY idx_year (year),
    KEY idx_classification_status (classification_status),
    FULLTEXT KEY ft_catalog_search (title, keywords, summary, genre, notes)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

8.5 `items_document` Extension Table

CREATE TABLE items_document (
    id              BIGINT AUTO_INCREMENT PRIMARY KEY,
    catalog_item_id BIGINT NOT NULL,       -- FK to catalog_items.id
    doi             VARCHAR(256) NULL,
    journal_name    VARCHAR(256) NULL,
    volume          VARCHAR(32) NULL,
    issue           VARCHAR(32) NULL,
    pages           VARCHAR(64) NULL,
    publication_date DATE NULL,
    citation_text   TEXT NULL,
    source_url      TEXT NULL,
    source_type     VARCHAR(32) NULL,       -- legacy_archive, streaming_upload, api_import
    source_path     TEXT NULL,              -- pre-flatten filesystem path
    document_type   ENUM('journal_article','conference_paper','preprint','book',
                        'book_chapter','report','thesis','working_paper',
                        'field_note','other') NOT NULL DEFAULT 'other',
    extracted_text  LONGTEXT NULL,
    disciplines_json JSON NULL,
    extraction_method VARCHAR(64) NULL,     -- regex, llm, hybrid, manual
    extraction_confidence ENUM('low','medium','high') NULL,
    llm_model       VARCHAR(64) NULL,
    tier_reached    TINYINT NULL,           -- 1..4
    sha256_pdf      CHAR(64) NULL,
    sha256_text     CHAR(64) NULL,
    size_bytes      BIGINT NULL,
    num_pages       INT NULL,
    num_figures     INT NULL,
    num_tables      INT NULL,
    archive_path    TEXT NULL,              -- relative path under Archive/documents/
    metadata_path   TEXT NULL,
    schema_version  SMALLINT NOT NULL DEFAULT 2,

    FOREIGN KEY (catalog_item_id) REFERENCES catalog_items(id) ON DELETE CASCADE,
    UNIQUE KEY uk_catalog_item (catalog_item_id),
    UNIQUE KEY uk_sha256_pdf (sha256_pdf),
    KEY idx_sha256_text (sha256_text),
    KEY idx_document_type (document_type),
    KEY idx_doi (doi),
    FULLTEXT KEY ft_extracted_text (extracted_text),
    FULLTEXT KEY ft_doc_metadata (citation_text, disciplines_json)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

8.6 `items_book` Extension Table

CREATE TABLE items_book (
    id              BIGINT AUTO_INCREMENT PRIMARY KEY,
    catalog_item_id BIGINT NOT NULL,       -- FK to catalog_items.id
    isbn            VARCHAR(20) NULL,
    issn            VARCHAR(16) NULL,
    publisher       VARCHAR(256) NULL,
    place_of_publication VARCHAR(256) NULL,
    publication_date DATE NULL,
    original_publication_date DATE NULL,
    page_count      INT NULL,
    word_count      INT NULL,
    series          VARCHAR(256) NULL,
    volume          VARCHAR(32) NULL,
    edition         VARCHAR(64) NULL,
    book_type       ENUM('hardcover','paperback','ebook','audiobook','pdf') NOT NULL DEFAULT 'paperback',
    ddc             VARCHAR(32) NULL,       -- Dewey Decimal
    lcc             VARCHAR(32) NULL,       -- Library of Congress
    lccn            VARCHAR(32) NULL,       -- LC Control Number
    oclc            VARCHAR(32) NULL,       -- OCLC number
    google_volume_id VARCHAR(32) NULL,
    narrator        VARCHAR(256) NULL,      -- for audiobooks
    illustrator     VARCHAR(256) NULL,
    translator      VARCHAR(256) NULL,
    audio_runtime   VARCHAR(32) NULL,       -- for audiobooks
    dimensions      VARCHAR(64) NULL,
    weight          VARCHAR(32) NULL,
    list_price      DECIMAL(10,2) NULL,
    reading_status  ENUM('unread','reading','read','abandoned') NULL,
    date_started    DATE NULL,
    date_finished   DATE NULL,
    current_page    INT NULL,
    recommended_by  VARCHAR(256) NULL,
    activities_json JSON NULL,             -- BookBuddy reading activity log

    FOREIGN KEY (catalog_item_id) REFERENCES catalog_items(id) ON DELETE CASCADE,
    UNIQUE KEY uk_catalog_item (catalog_item_id),
    KEY idx_isbn (isbn),
    KEY idx_publisher (publisher),
    KEY idx_book_type (book_type),
    KEY idx_series (series)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

8.7 `items_video` Extension Table

CREATE TABLE items_video (
    id              BIGINT AUTO_INCREMENT PRIMARY KEY,
    catalog_item_id BIGINT NOT NULL,       -- FK to catalog_items.id
    content_type    ENUM('movie','movie_collection') NOT NULL DEFAULT 'movie',
    runtime_minutes INT NULL,
    release_date    DATE NULL,
    original_release_year SMALLINT NULL,
    directors       VARCHAR(512) NULL,     -- denormalized for indexing
    writers         TEXT NULL,
    composers       VARCHAR(512) NULL,
    cast_json       JSON NULL,             -- [{actor, character}]
    production_companies TEXT NULL,
    production_countries VARCHAR(256) NULL,
    producers       TEXT NULL,
    cinematographers TEXT NULL,
    editors         TEXT NULL,
    media_format    VARCHAR(128) NULL,     -- comma-separated: DVD, Blu-ray, Apple m4v, etc.
    film_rating     VARCHAR(8) NULL,       -- PG-13, R, NR, G, PG, NC-17
    sound           VARCHAR(64) NULL,
    aspect_ratio    VARCHAR(32) NULL,
    number_of_discs TINYINT NULL,
    color           VARCHAR(16) NULL,
    packaging       VARCHAR(64) NULL,
    budget          VARCHAR(64) NULL,
    edition         VARCHAR(128) NULL,
    home_page       TEXT NULL,
    tmdb_id         INT NULL,
    imdb_id         VARCHAR(16) NULL,
    upc_ean13       VARCHAR(20) NULL,
    trailer_youtube_id VARCHAR(16) NULL,
    collection_type VARCHAR(64) NULL,       -- for movie collections
    watch_status    ENUM('watched','not_watched') NULL,
    date_finished   DATE NULL,
    activities_json JSON NULL,             -- MovieBuddy watch activity log

    FOREIGN KEY (catalog_item_id) REFERENCES catalog_items(id) ON DELETE CASCADE,
    UNIQUE KEY uk_catalog_item (catalog_item_id),
    KEY idx_tmdb_id (tmdb_id),
    KEY idx_imdb_id (imdb_id),
    KEY idx_content_type (content_type),
    KEY idx_film_rating (film_rating),
    KEY idx_directors (directors)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

8.8 `items_tv_show` Extension Table

CREATE TABLE items_tv_show (
    id              BIGINT AUTO_INCREMENT PRIMARY KEY,
    catalog_item_id BIGINT NOT NULL,       -- FK to catalog_items.id
    number_of_seasons INT NULL,
    first_air_year  SMALLINT NULL,
    first_air_date  DATE NULL,
    last_air_year   SMALLINT NULL,
    last_air_date   DATE NULL,
    tv_creators     TEXT NULL,
    tv_networks     VARCHAR(256) NULL,
    tv_episodes_json JSON NULL,            -- full episode structure from MovieBuddy
    tv_season       VARCHAR(64) NULL,
    directors       VARCHAR(512) NULL,
    writers         TEXT NULL,
    composers       VARCHAR(512) NULL,
    cast_json       JSON NULL,             -- [{actor, character}]
    production_companies TEXT NULL,
    production_countries VARCHAR(256) NULL,
    media_format    VARCHAR(128) NULL,
    film_rating     VARCHAR(8) NULL,
    runtime_minutes INT NULL,              -- per-episode average
    tmdb_id         INT NULL,
    imdb_id         VARCHAR(16) NULL,
    upc_ean13       VARCHAR(20) NULL,
    trailer_youtube_id VARCHAR(16) NULL,
    home_page       TEXT NULL,
    watch_status    ENUM('watched','not_watched') NULL,
    date_finished   DATE NULL,
    activities_json JSON NULL,

    FOREIGN KEY (catalog_item_id) REFERENCES catalog_items(id) ON DELETE CASCADE,
    UNIQUE KEY uk_catalog_item (catalog_item_id),
    KEY idx_tmdb_id (tmdb_id),
    KEY idx_imdb_id (imdb_id),
    KEY idx_first_air_year (first_air_year)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

8.9 `item_assets` Table

CREATE TABLE item_assets (
    id              BIGINT AUTO_INCREMENT PRIMARY KEY,
    catalog_item_id BIGINT NOT NULL,       -- FK to catalog_items.id
    asset_type      ENUM('pdf','epub','mp4','mkv','m4v','avi',
                        'cover','thumbnail','poster','figure') NOT NULL,
    file_path       TEXT NOT NULL,
    file_name       VARCHAR(512) NOT NULL,
    file_size       BIGINT NULL,
    sha256          CHAR(64) NULL,
    mime_type       VARCHAR(128) NULL,
    extracted       TINYINT NOT NULL DEFAULT 0, -- 1 if text extraction has been run
    created_at      DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (catalog_item_id) REFERENCES catalog_items(id) ON DELETE CASCADE,
    KEY idx_catalog_item (catalog_item_id),
    KEY idx_asset_type (asset_type),
    KEY idx_sha256 (sha256)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

8.10 `document_aliases` Table

CREATE TABLE document_aliases (
    id              BIGINT AUTO_INCREMENT PRIMARY KEY,
    catalog_item_id BIGINT NOT NULL,       -- FK to catalog_items.id (must be item_type='document')
    alias_filename  TEXT NOT NULL,
    alias_source_path TEXT NULL,
    discovered_at   DATETIME NOT NULL,
    discovery_method VARCHAR(32) NOT NULL,  -- 'sha256_pdf' or 'sha256_text'

    FOREIGN KEY (catalog_item_id) REFERENCES catalog_items(id) ON DELETE CASCADE,
    KEY idx_catalog_item (catalog_item_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

8.11 `ingestion_batch_jobs` Table

CREATE TABLE ingestion_batch_jobs (
    id              BIGINT AUTO_INCREMENT PRIMARY KEY,
    input_filename  VARCHAR(32) NOT NULL,   -- input_NNNNN.pdf
    original_filename TEXT NOT NULL,
    sha256_pdf      CHAR(64) NOT NULL,
    size_bytes      BIGINT NOT NULL,
    state           ENUM('pending','extracting','extracted',
                        'classifying','classified','written',
                        'failed','skipped_duplicate') NOT NULL,
    catalog_item_id BIGINT NULL,            -- set when state='written'
    attempt_count   INT NOT NULL DEFAULT 0,
    last_error      TEXT NULL,
    tier_reached    TINYINT NULL,           -- 1..4
    extract_started_at DATETIME NULL,
    extract_ended_at   DATETIME NULL,
    classify_started_at DATETIME NULL,
    classify_ended_at  DATETIME NULL,
    written_at      DATETIME NULL,
    created_at      DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at      DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,

    KEY idx_state (state),
    KEY idx_sha256_pdf (sha256_pdf)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

8.12 `catalog_tags` Table

CREATE TABLE catalog_tags (
    id              BIGINT AUTO_INCREMENT PRIMARY KEY,
    tag             VARCHAR(128) NOT NULL,
    catalog_item_id BIGINT NOT NULL,

    FOREIGN KEY (catalog_item_id) REFERENCES catalog_items(id) ON DELETE CASCADE,
    UNIQUE KEY uk_tag_item (tag, catalog_item_id),
    KEY idx_tag (tag)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

8.13 FULLTEXT Index Strategy

The FULLTEXT index strategy inherits from the original Librarian's three-index approach and extends it for the unified catalog:

ft_catalog_search on catalog_items(title, keywords, summary, genre, notes) — the narrow metadata index shared by all item types. This is the primary search target for the Librarian's main search box.
ft_extracted_text on items_document(extracted_text) — the deep body index for document content only. Used when the user explicitly requests full-text search.
ft_doc_metadata on items_document(citation_text, disciplines_json) — supplemental metadata index for scholarly search.

Short-keyword and quoted-phrase search uses the REGEXP/LIKE fallback layer from the original Librarian's functions.php, bypassing FULLTEXT's innodb_ft_min_token_size minimum (default 3 characters). This layer is implemented in the Librarian UI's PHP search module.

9. Archive Filesystem Layout

Per-document artifacts are stored under a unified archive layout keyed by canonical_id:

~/Macroscope/Archive/documents/
    <year>/                              # derived from publication_date, or 'undated/'
        ARC-DOC-NNNNNN/
            source.pdf                   # original bytes, post-rename
            extracted.md                 # output of read_document_basic
            metadata.json                # v2.0 metadata (six sections)
            figures/
                figure_01.png
                figure_02.png
                ...
            visuals.json                 # sidecar with per-figure metadata
            thumbnail.png                # first-page thumbnail (optional)

Book and video items do not require archive directories unless they have associated file assets (ebooks, ripped video files). When file assets exist, they are stored at paths recorded in item_assets.file_path. Cover images downloaded during CSV import are stored at the path in catalog_items.cover_image_path.

Sibling buckets under ~/Macroscope/Archive/ are reserved for future expansion:

~/Macroscope/Archive/
    documents/                           # PDF documents (active, this spec)
    books/                               # ebook files (future, when ebook ingestion is built)
    video/                               # ripped video files (future)
    audio/                               # audio recordings (future)
    images/                              # photographs (future)

10. Batch Reingest Workflow

The batch reingest tool batch_reingest.py orchestrates the full document corpus ingestion. On startup it reads the three preprocessing manifests, populates ingestion_batch_jobs with one row per input_NNNNN.pdf file in state pending, and then enters a dispatch loop.

For each pending row, the orchestrator:

Transitions state to extracting, dispatches a read_document_basic job, waits for the result.
Checks sha256_text against existing items_document rows. If match, transitions state to skipped_duplicate, records the alias in document_aliases, and continues.
Transitions state to classifying, dispatches a classify_document job, waits for the result.
Assembles the full v2.0 metadata record by merging extraction output and classification output.
Inserts a catalog_items row with item_type = 'document', derives canonical_id = 'ARC-DOC-' + LPAD(id, 6, '0').
Inserts the corresponding items_document row with all classification and extraction fields.
Moves the source PDF, extracted.md, metadata.json, figures, and visuals sidecar into the archive layout.
Moves the source PDF in Input/ to ~/Macroscope/Archive/Documents/Input_ingested/.
Transitions state to written, sets catalog_item_id, and continues to the next pending row.

On crash or interrupt, the orchestrator can be restarted cleanly: any row in extracting or classifying is reset to pending on the next run, and all intermediate artifacts in the workspace directory are discarded. Rows in written are untouched. This gives full resumability with no risk of double-inserting a document.

The orchestrator logs every tier decision, every cost-bearing API call, and every state transition to ~/Macroscope/Projects/Live/collectors/batch_reingest/logs/batch_reingest.log.

11. Librarian UI Integration

The Librarian UI is a web panel rendered under ~/Macroscope/Projects/Workbench/Collaboratory/librarian/ and served at http://localhost/Projects/Workbench/Collaboratory/librarian/. It reads from librarian_db tables and from the Archive/documents/ filesystem layout. This specification defines only the read contract and UX patterns; the UI implementation is documented separately.

11.1 Unified Browse and Search

The default view lists all catalog_items rows, with tab-based or sidebar filtering by item_type. Within each type, filters include year range, genre, keyword, and type-specific fields (document_type for documents, book_type for books, film_rating for videos, network for TV shows). Full-text search is performed against the ft_catalog_search FULLTEXT index, with the REGEXP/LIKE fallback layer for short keywords and quoted phrases.

A single search box at the top of the page accepts natural language queries via the "Ask the Librarian" (Hypatia) interface described in Section 11.5.

11.2 Item Detail View

Each item has a detail page showing:

All types: Title, creators, year, genre, summary, keywords, cover image, physical location, status, rating, acquisition info, tags, alias list (for documents).
Documents: Rendered Markdown content, inline thumbnail, figures gallery, classification provenance (tier, model, confidence), download link to original PDF.
Books: Publisher, ISBN, series/volume, page count, edition, DDC/LCC, reading progress (current page, dates started/finished), BookBuddy activity log.
Videos: Director, cast, runtime, film rating, format (DVD/Blu-ray/digital), TMDB/IMDB links, trailer embed (YouTube), watch status, MovieBuddy activity log.
TV Shows: Creators, networks, season/episode structure, air date range, cast, TMDB/IMDB links.

11.3 Pipeline Progress Dashboard

An admin-only view reads ingestion_batch_jobs and shows the real-time state of the batch reprocessing run: total count, count per state, tier distribution, cumulative Tier 3 token cost, average latency per tier, and a live-updating progress bar. Rows in failed state are clickable and surface the error detail for debugging.

11.4 Manual Review Queue

A second admin-only view lists all catalog items with classification_status = 'manual_review'. Each row is editable: the curator can fill in missing fields directly, promote the item to classification_status = 'complete', or delete the row if unsuitable for the catalog.

11.5 Hypatia: AI Librarian Persona

The Librarian UI inherits and extends the "Hypatia" AI persona from the original Librarian project. Hypatia serves two functions:

Natural-language intent-to-query translation. A query like "find me anything about mycorrhizal networks published after 2015" is sent to a small LLM (Ollama gemma4:31b-cloud) with a collection-statistics preamble, and the model returns structured search parameters (keywords, field:value filters, year range) that drive the PHP search engine. This is strictly cheaper and more predictable than letting the LLM perform retrieval itself.
Per-item chat. On any item's detail page, the user can open a chat session with Hypatia contextualized to that specific item. For documents, this includes the extracted text; for books, the summary and metadata; for movies, the cast and plot summary. Conversation memory is session-scoped with a rolling 3-exchange window (inherited from the original ask_llm.php pattern). The LLM model is selectable via a dropdown (gemma4:31b-cloud default, Claude Haiku or Sonnet for deeper analysis).

11.6 Recommendation Engine

The item detail page includes a "Related Items" section using a cascade recommendation strategy inherited from the original Librarian's getBookRecommendations():

FULLTEXT MATCH on shared keywords (cross-type: a book about sensor networks may recommend a related journal article).
Same genre within the same item type.
Same creator within the same item type.
Same series (books) or same collection (movies).
Same ingest batch (documents).

The cascade stops when five recommendations are found.

11.7 Statistics Dashboard

A public-facing statistics page shows aggregate counts and distributions: total items by type, unique creators, genre distributions, year histograms, format breakdowns (book_type, media_format), reading/watch status summaries, and collection growth over time. Inherited from the original Librarian's statistics.php and expanded to cover all item types.

12. Open Questions

The following decisions are deferred to a subsequent specification revision and are not blocking the initial batch import runs:

Visuals storage depth. Whether to create a dedicated document_visuals table for per-figure metadata or defer to a visuals.json sidecar.
Thumbnail generation tool. Candidates include macOS sips, ImageMagick, and the Python pdf2image library. Selection depends on available dependencies on Data and Galatea.
Pre-existing PDFs/ versus preprocessed/ directory handling. Legacy archive trees may contain already-preprocessed Markdown and metadata files from earlier ingestion attempts; the pipeline must decide whether to trust or reprocess them.
Non-PDF content in the legacy archive. EPUB, HTML, and plain-text documents currently sit alongside PDFs in the archive source tree. A separate reader worker family will be needed.
OCR tier reintroduction. Whether and when to add an OCR-based extraction path for image-only PDFs.
Cover image sourcing. Whether to download cover images from Google Books (books) and TMDB (movies/TV) during CSV import, or to defer to manual upload. Storage path convention for covers needs definition.
Ebook-as-book ingestion. An ebook (item_type='book', book_type='ebook') that has an associated PDF or EPUB file asset should route through the document extraction pipeline for full-text indexing. The trigger mechanism (manual versus automatic on asset upload) is undecided.
Hypatia model selection. Whether the per-item chat should default to gemma4:31b-cloud or to a smaller model for cost efficiency on high-volume usage.
Cross-machine sync for librarian_db. Whether to replicate librarian_db from Galatea to Data (like macroscope), from Data to Galatea, or maintain independent instances. The catalog is personal-use only and will likely be managed from Data.

These are tracked as open items and will be resolved in v0.3 of this specification.

13. References

[1] Hamilton, M. P. (2026). "Macroscope Collaboratory: MCP-Hosted Research Application." Canemah Nature Laboratory Technical Note CNL-TN-2026-047.

[2] Hamilton, M. P. (2026). "Canemah Nature Laboratory Technical Note Style Guide." CNL-SG-2025-002 v1.1.

[3] Artifex Software (2024). "PyMuPDF: Python bindings for MuPDF." https://pymupdf.readthedocs.io (accessed April 11, 2026).

[4] Singer-Vine, J. et al. (2024). "pdfplumber: Plumb a PDF for detailed information." https://github.com/jsvine/pdfplumber (accessed April 11, 2026).

[5] Ollama (2025). "Ollama: Run Large Language Models Locally." https://ollama.ai (accessed April 11, 2026).

[6] Gemma Team (2025). "Gemma 3 Technical Report." Google DeepMind. arXiv:2503.19786.

[7] Anthropic (2026). "Claude API Reference." https://docs.claude.com (accessed April 11, 2026).

[8] Hamilton, M. P. (2026). "STRATA 2.0 Intelligence Architecture." Canemah Nature Laboratory Technical Note CNL-TN-2026-043.

[9] Hamilton, M. P. (2024–2025). "The Librarian (Hypatia)." Unpublished internal project. ~/Macroscope/Projects/Reference/MacroNexus/Active/Tools/The-Librarian/.

14. CSV Import Pipeline

14.1 Overview

Book and movie/TV records are imported from CSV exports produced by the BookBuddy and MovieBuddy iOS applications. Both apps belong to the same family and share a common export structure: comma-delimited with a header row, UTF-8 encoding, JSON-encoded arrays for structured fields (cast, episodes, activities), and comma-separated strings for multi-value fields (genres, formats, languages).

The import scripts are standalone Python utilities in collectors/batch_reingest/ that read one CSV file, map columns to the catalog schema, and insert rows into librarian_db. Each script is idempotent: re-running against the same CSV skips rows that already exist (matched on title + creator + year composite key for books, TMDB ID or title + year for videos).

14.2 `import_books.py` — BookBuddy CSV Import

Source: BookBuddy CSV export (80 columns, 1,311 rows in the April 11, 2026 export).

Column mapping to catalog_items:

BookBuddy Column	Target Column	Notes
Title	title
Subtitle	subtitle
Author	creators_json	`[{name: Author, role: "author"}]`; Illustrator, Narrator, Translator, Photographer, Editor added with respective roles when present
Year Published	year	Five rows have garbled value "100"; cleaned to NULL
Genre	genre	Verbatim
Summary	summary
Language	language
Original Language	original_language
Rating	rating	Scale 0.0–5.0, multiplied by 2 for 0.0–10.0 normalization
Physical Location	physical_location	Empty in current export
Status	status	Mapped: Read→read, Reading→reading, Unread→unread
Date Added	created_at
Tags	tags
Purchase Date	acquired_at
Purchase Price	cost
Notes	notes
Category	(used for items_book.book_type)	Mapped: hardcover→hardcover, paperback→paperback, ebook→ebook
Uploaded Image URL	cover_image_path	Stored as-is; future: download and localize

Column mapping to items_book:

BookBuddy Column	Target Column
ISBN	isbn
ISSN	issn
Publisher	publisher
Place of Publication	place_of_publication
Date Published	publication_date
Original Date Published	original_publication_date
Number of Pages	page_count
Word Count	word_count
Series	series
Volume	volume
Edition	edition
Category	book_type
DDC	ddc
LCC	lcc
LCCN	lccn
OCLC	oclc
Google VolumeID	google_volume_id
Narrator	narrator
Illustrator	illustrator
Translator	translator
Audio Runtime	audio_runtime
Dimensions	dimensions
Weight	weight
List Price	list_price
Date Started	date_started
Date Finished	date_finished
Current Page	current_page
Recommended By	recommended_by
Activities	activities_json

Data cleaning:

Year Published values of "100" or other non-four-digit values are set to NULL.
Status mapping: BookBuddy uses "Read"/"Reading"/"Unread"; mapped to lowercase enums.
Category "ebook" maps to both catalog_items.status = 'digital_only' and items_book.book_type = 'ebook'.
Author (Last, First) column is ignored; the plain Author column is used for display.
canonical_id is assigned as ARC-BOK- + zero-padded catalog_items.id after insert.
import_source set to 'bookbuddy'.

14.3 `import_movies.py` — MovieBuddy CSV Import

Source: MovieBuddy CSV export (81 columns, 652 rows in the April 11, 2026 export).

The import script reads the CSV once and routes each row to one of three extension tables based on the Content Type column:

Content Type	item_type	Extension Table
Movie	video	items_video
Movie Collection	video	items_video (content_type = 'movie_collection')
TV Show	tv_show	items_tv_show

Column mapping to catalog_items:

MovieBuddy Column	Target Column	Notes
Title	title
Original Title	subtitle	Used as subtitle when different from Title
Directors / TV Creators	creators_json	`[{name, role: "director"}]` or `[{name, role: "creator"}]`
Release Year / First Air Year	year
Genres	genre	Verbatim comma-separated
Summary	summary
Languages	language	First language in comma-separated list
Rating	rating	Already 0.0–10.0 scale
Physical Location	physical_location
Status	status	Mapped: Watched→watched, Not Watched→not_watched
Date Added	created_at
Tags	tags
Purchase Date	acquired_at
Purchase Price	cost
Notes	notes
Uploaded Image URL	cover_image_path

Column mapping to items_video (movies):

MovieBuddy Column	Target Column
Content Type	content_type
Runtime	runtime_minutes
Release Date	release_date
Original Release Year	original_release_year
Directors	directors
Writers	writers
Composers	composers
Cast	cast_json (parsed from MovieBuddy JSON)
Production Companies	production_companies
Production Countries	production_countries
Producers	producers
Cinematographers	cinematographers
Editors	editors
Format	media_format
Film Rating	film_rating
Sound	sound
Aspect Ratio	aspect_ratio
Number Of Discs	number_of_discs
Color	color
Packaging	packaging
Budget	budget
Edition	edition
Home Page	home_page
TMDB ID	tmdb_id
IMDB ID	imdb_id
UPC-EAN13	upc_ean13
Trailer YouTube ID	trailer_youtube_id
Collection Type	collection_type
Date Finished	date_finished
Activities	activities_json

Column mapping to items_tv_show (TV shows):

Shares most video columns plus: Number Of Seasons → number_of_seasons, First Air Year/Date → first_air_year/first_air_date, Last Air Year/Date → last_air_year/last_air_date, TV Creators → tv_creators, TV Networks → tv_networks, TV Episodes → tv_episodes_json, TV Season → tv_season.

Data cleaning:

Cast JSON is already well-formed [{actor, character}] from MovieBuddy; parsed directly.
Format field contains comma-separated values like "Blu-ray Format,DVD Format,Apple m4v Format"; the " Format" suffix is stripped during import.
TMDB ID is the primary dedup key for idempotent re-import.
canonical_id assigned as ARC-VID- or ARC-TVS- + zero-padded id after insert.
import_source set to 'moviebuddy'.

14.4 Import Execution Order

Create librarian_db and all tables (Section 8).
Run import_books.py against BookBuddy CSV. Result: ~1,311 catalog_items + items_book rows.
Run import_movies.py against MovieBuddy CSV. Result: ~624 catalog_items + items_video rows, ~28 catalog_items + items_tv_show rows.
Run batch_reingest.py against the document corpus. Result: ~5,299 catalog_items + items_document rows (minus near-duplicates).
Verify counts, spot-check records, resolve any manual_review items.

Document History

Version	Date	Changes
0.1	2026-04-11	Initial draft: document ingestion pipeline only
0.2	2026-04-11	Extensible catalog model: unified supertable with four item types (document, book, video, tv_show), item_assets table, catalog_tags table. New database librarian_db supersedes Archive_DB and macroscope.documents. CSV import pipeline for BookBuddy (1,311 books) and MovieBuddy (652 movies/TV). Hypatia AI persona and recommendation engine formalized. FULLTEXT strategy expanded to three indices. Nine new open questions. Updated abstract, objectives, scope, component inventory, archive layout, and UI integration sections.

End of Specification

Cite This Document

Michael P. Hamilton, Ph.D. (2026). "Macroscope Collaboratory Librarian: Unified Catalog and Document Ingestion Pipeline Specification." Canemah Nature Laboratory Specification CNL-SP-2026-050. https://canemah.org/archive/CNL-SP-2026-050

BibTeX

@manual{hamilton2026macroscope, author = {Hamilton, Michael P., Ph.D.}, title = {Macroscope Collaboratory Librarian: Unified Catalog and Document Ingestion Pipeline Specification}, institution = {Canemah Nature Laboratory}, year = {2026}, number = {CNL-SP-2026-050}, month = {april}, url = {https://canemah.org/archive/document.php?id=CNL-SP-2026-050}, abstract = {The Macroscope Collaboratory Librarian is the fourth panel of the I3 shell (Instruments, Investigations, Intelligence, Librarian) and serves as the research surface for a unified personal holdings catalog and scientific document archive. This specification describes an extensible catalog model that supports four item types — documents, books, videos, and TV shows — unified under a single `catalog\_items` supertable with per-type extension tables and a shared asset layer, plus the worker-agent pipeline that ingests PDFs from a raw filesystem into the catalog. The initial corpus comprises approximately 5,300 legacy PDFs (post-deduplication from 8,515 files), 1,311 books cataloged in BookBuddy, and 652 movies and TV shows cataloged in MovieBuddy, totaling approximately 7,264 items at launch. The document ingestion pipeline comprises three stages: a filesystem preprocessing layer that flattens, deduplicates on SHA-256, and rename-normalizes incoming PDFs; a two-worker ingestion stage in which `read\_document\_basic` converts PDF to Markdown using PyMuPDF and pdfplumber while `classify\_document` runs a three-tier escalation ladder to generate bibliographic metadata; and a persistence stage that writes normalized records to the catalog with content-hash deduplication and archive-layout file placement. The classification ladder escalates from deterministic regex (Tier 1, zero cost) through Ollama gemma4:31b-cloud (Tier 2, local-daemon cloud inference) to Anthropic Claude Haiku with optional Sonnet escalation (Tier 3, paid API) before flagging unresolved documents for manual review (Tier 4). A dedicated `ingestion\_batch\_jobs` table tracks per-document state to provide crash resumability and a progress surface for the Librarian UI. Book and video records are imported from BookBuddy and MovieBuddy CSV exports via a parallel CSV import pipeline. The catalog model is designed to accommodate future item types (audio recordings, photographs, field specimens) without schema changes to the core supertable. This specification defines the catalog data model, worker contracts, the v2.0 metadata schema, database schema, escalation policy, cost envelope, CSV import pipeline, and the downstream interface to the Librarian web panel. --- **Keywords:** unified catalog; document ingestion pipeline; extensible schema; pdf text extraction; llm classification; metadata generation; three-tier escalation; ollama gemma; anthropic claude; content-hash deduplication; worker agent architecture; macroscope collaboratory; librarian ui; csv import; bookbuddy; moviebuddy} }

Permanent URL: https://canemah.org/archive/document.php?id=CNL-SP-2026-050

Macroscope Collaboratory Librarian: Unified Catalog and Document Ingestion Pipeline Specification

Abstract

1. Introduction

1.1 Context

1.2 Objectives

1.3 Scope

1.4 Non-Goals

2. System Architecture

2.1 High-Level Pipeline

2.2 Component Inventory

2.3 Worker Framework Context

3. Preprocessing Pipeline

3.1 flatten_pdfs.py

3.2 dedup_pdfs.py

3.3 rename_pdfs.py

3.4 Manifest Contract

4. Worker Agent Architecture

4.1 Worker Framework Overview

4.2 Job Contract

4.3 Two-Worker Split Rationale

5. Document Extraction (read_document_basic)

5.1 Extraction Engine

5.2 Payload Contract

5.3 Output Artifacts

5.4 Limitations

6. Three-Tier Classification Ladder (classify_document)

6.1 Escalation Policy

6.2 Tier 1: Deterministic Regex

6.3 Tier 2: Ollama gemma4:31b-cloud

6.4 Tier 3: Anthropic Claude Haiku / Sonnet

6.5 Tier 4: Manual Review Flag

6.6 Metadata Merge Semantics

6.7 Normalized-Text Deduplication

7. Metadata Schema v2.0

7.1 source

7.2 bibliography

7.3 classification

7.4 extraction

7.5 visuals

7.6 content_hashes

8. Extensible Catalog Model and Database Schema

8.1 Design Rationale

8.2 Database: librarian_db

8.3 Canonical Identifiers

8.4 catalog_items Table (Supertable)

8.5 items_document Extension Table

8.6 items_book Extension Table

8.7 items_video Extension Table

8.8 items_tv_show Extension Table

8.9 item_assets Table

8.10 document_aliases Table

8.11 ingestion_batch_jobs Table

8.12 catalog_tags Table

8.13 FULLTEXT Index Strategy

9. Archive Filesystem Layout

10. Batch Reingest Workflow

11. Librarian UI Integration

11.1 Unified Browse and Search

11.2 Item Detail View

11.3 Pipeline Progress Dashboard

11.4 Manual Review Queue

11.5 Hypatia: AI Librarian Persona

11.6 Recommendation Engine

11.7 Statistics Dashboard

12. Open Questions

13. References

14. CSV Import Pipeline

14.1 Overview

14.2 import_books.py — BookBuddy CSV Import

14.3 import_movies.py — MovieBuddy CSV Import

14.4 Import Execution Order

Document History

Cite This Document

BibTeX

5. Document Extraction (`read_document_basic`)

6. Three-Tier Classification Ladder (`classify_document`)

8.2 Database: `librarian_db`

8.4 `catalog_items` Table (Supertable)

8.5 `items_document` Extension Table

8.6 `items_book` Extension Table

8.7 `items_video` Extension Table

8.8 `items_tv_show` Extension Table

8.9 `item_assets` Table

8.10 `document_aliases` Table

8.11 `ingestion_batch_jobs` Table

8.12 `catalog_tags` Table

14.2 `import_books.py` — BookBuddy CSV Import

14.3 `import_movies.py` — MovieBuddy CSV Import