Macroscope Collaboratory Librarian: Unified Catalog and Document Ingestion Pipeline Specification
Macroscope Collaboratory Librarian: Unified Catalog and Document Ingestion Pipeline Specification
Document ID: CNL-SP-2026-050
Version: 0.2
Date: April 11, 2026
Author: Michael P. Hamilton, Ph.D.
Status: Draft for review
Visibility: Public
AI Assistance Disclosure: This specification was developed with assistance from Anthropic Claude (Opus 4.6) in Cowork sessions on April 11, 2026. The AI contributed to architecture formalization, worker contract documentation, schema design, CSV column mapping, and manuscript drafting. The author takes full responsibility for the content, accuracy, and conclusions.
Abstract
The Macroscope Collaboratory Librarian is the fourth panel of the I3 shell (Instruments, Investigations, Intelligence, Librarian) and serves as the research surface for a unified personal holdings catalog and scientific document archive. This specification describes an extensible catalog model that supports four item types — documents, books, videos, and TV shows — unified under a single catalog_items supertable with per-type extension tables and a shared asset layer, plus the worker-agent pipeline that ingests PDFs from a raw filesystem into the catalog. The initial corpus comprises approximately 5,300 legacy PDFs (post-deduplication from 8,515 files), 1,311 books cataloged in BookBuddy, and 652 movies and TV shows cataloged in MovieBuddy, totaling approximately 7,264 items at launch. The document ingestion pipeline comprises three stages: a filesystem preprocessing layer that flattens, deduplicates on SHA-256, and rename-normalizes incoming PDFs; a two-worker ingestion stage in which read_document_basic converts PDF to Markdown using PyMuPDF and pdfplumber while classify_document runs a three-tier escalation ladder to generate bibliographic metadata; and a persistence stage that writes normalized records to the catalog with content-hash deduplication and archive-layout file placement. The classification ladder escalates from deterministic regex (Tier 1, zero cost) through Ollama gemma4:31b-cloud (Tier 2, local-daemon cloud inference) to Anthropic Claude Haiku with optional Sonnet escalation (Tier 3, paid API) before flagging unresolved documents for manual review (Tier 4). A dedicated ingestion_batch_jobs table tracks per-document state to provide crash resumability and a progress surface for the Librarian UI. Book and video records are imported from BookBuddy and MovieBuddy CSV exports via a parallel CSV import pipeline. The catalog model is designed to accommodate future item types (audio recordings, photographs, field specimens) without schema changes to the core supertable. This specification defines the catalog data model, worker contracts, the v2.0 metadata schema, database schema, escalation policy, cost envelope, CSV import pipeline, and the downstream interface to the Librarian web panel.
Keywords: unified catalog; document ingestion pipeline; extensible schema; pdf text extraction; llm classification; metadata generation; three-tier escalation; ollama gemma; anthropic claude; content-hash deduplication; worker agent architecture; macroscope collaboratory; librarian ui; csv import; bookbuddy; moviebuddy
1. Introduction
1.1 Context
The Macroscope Collaboratory is a four-panel research application hosted inside the Claude Desktop app via a dedicated Model Context Protocol (MCP) server. The application is organized around the I3 shell: Instruments (live sensor dashboards across EARTH, LIFE, HOME, SELF domains), Investigations (structured human-AI research workflows), Intelligence (the STRATA micro-agent layer and narrative synthesis), and Librarian (the unified holdings catalog and research-grade reading surface).
The Librarian panel is the newest of the four and subsumes both a document corpus accumulated over approximately two decades across research projects at the James San Jacinto Mountains Reserve, Blue Oak Ranch Reserve, the Center for Embedded Networked Sensing (CENS), the Canemah Nature Laboratory, and personal research activity; a personal book library cataloged in BookBuddy; and a movie and TV show collection cataloged in MovieBuddy. The document corpus includes peer-reviewed journal articles, conference papers, arXiv preprints, government reports, agency technical documents, field manuals, and internal working notes. As of April 2026 the document archive contains 5,299 unique PDFs totaling 32.27 GB after dedup, the book catalog contains 1,311 titles across 1,000 unique authors, and the movie/TV catalog contains 652 titles with rich metadata including TMDB and IMDB identifiers.
The Librarian builds on a prior project, "The Librarian" (Hypatia), developed circa 2024–2025 within the MacroNexus/MEO ecosystem. Hypatia was a personal-use book catalog and AI-assisted search interface using Ollama gemma3:4b and a "Hypatia" librarian AI persona. Key design precedents from that project — the ARC-XXX-NNNNNN canonical identifier scheme, the pdf_documents table schema, three FULLTEXT indices, natural-language intent-to-query translation, per-item chat with session-based memory, and the REGEXP/LIKE fallback search layer — are inherited and formalized in this specification.
1.2 Objectives
This specification defines the extensible catalog model and the end-to-end document ingestion pipeline that transforms raw PDFs, BookBuddy CSVs, and MovieBuddy CSVs into structured, searchable records in librarian_db and the unified archive filesystem layout. Specifically, the system must:
- Provide a single unified catalog (
catalog_items) that accommodates documents, books, videos, TV shows, and future item types without schema changes to the core table. - Preserve per-type specialized metadata in extension tables (
items_document,items_book,items_video,items_tv_show) while sharing universal ownership and acquisition fields. - Track file assets (PDFs, ebooks, video files, cover images) in a shared
item_assetstable linked to catalog items. - Normalize document filenames to ASCII-safe identifiers while preserving original names as recoverable data.
- Deduplicate document content byte-for-byte via SHA-256 before any expensive processing.
- Convert each PDF to a Markdown representation with figure extraction.
- Generate bibliographic metadata using the cheapest reliable method available per document via a three-tier escalation ladder.
- Persist all document artifacts under a stable archive path keyed by canonical identifier.
- Import book records from BookBuddy CSV exports with full field mapping.
- Import movie and TV show records from MovieBuddy CSV exports with full field mapping including structured cast JSON.
- Track batch-reprocessing state in a database table that supports resumability after crashes and provides a real-time progress view to the Librarian UI.
- Reuse the same worker contracts for single-document streaming ingestion as the archive grows post-migration.
1.3 Scope
This specification covers the extensible catalog data model (Section 8), the preprocessing tools (Section 3), the worker-agent ingestion stage (Sections 4–6), the metadata schema (Section 7), the archive filesystem layout (Section 9), the batch reingest workflow (Section 10), the Librarian UI integration (Section 11), the CSV import pipeline (Section 14), and open questions (Section 12). It does not cover the Librarian UI implementation itself, the STRATA Intelligence integration, or the Claude Desktop MCP server glue — each of which is documented in separate notes referenced in Section 13.
1.4 Non-Goals
The following are explicitly out of scope for this specification:
- OCR of scanned or image-only PDFs. The current pipeline assumes extractable text. A dedicated tier for OCR-required documents is deferred to a future extension.
- Multi-language support. The pipeline targets English-language scientific documents. Book and movie records in other languages are stored as-is but not processed for full-text search.
- Full-text search backend selection beyond MySQL FULLTEXT. The three-index strategy (narrow metadata, full body, combined) established by the original Librarian is adopted; external search engines are a downstream decision.
- Access control beyond the existing Collaboratory three-tier auth (anonymous, user, admin).
- Streaming video playback or audio playback infrastructure.
- Automated metadata enrichment from external APIs (Google Books, TMDB) beyond initial CSV import. Manual enrichment is supported via the Librarian UI.
2. System Architecture
2.1 High-Level Pipeline
Legacy archive / CSV exports Preprocessing / Import Worker ingestion Persistence Librarian UI
(nested PDFs in Archive/, → (flatten, dedup, rename → (read_document_basic, → (librarian_db, → (search,
BookBuddy CSV, for PDFs; CSV parse classify_document Archive/ filters,
MovieBuddy CSV) for books/movies) for documents only) documents/) viewer,
Hypatia chat)
Each stage produces durable artifacts that survive crashes in the next stage: preprocessing writes JSONL manifests, CSV import writes per-batch import logs, worker ingestion writes per-document Markdown and metadata files and database rows, and persistence writes to the unified archive layout. The pipeline is reentrant at every boundary.
2.2 Component Inventory
| Component | Location | Role |
|---|---|---|
flatten_pdfs.py |
collectors/batch_reingest/ |
Walk nested source tree, flatten PDFs into Input/, write flatten manifest |
dedup_pdfs.py |
collectors/batch_reingest/ |
SHA-256 exact-duplicate pass, write dedup manifest |
rename_pdfs.py |
collectors/batch_reingest/ |
Assign sequential ASCII-safe filenames, write rename manifest |
read_document_basic |
workers/read_document_basic/ |
PDF-to-Markdown conversion, figure extraction |
classify_document |
workers/classify_document/ |
Three-tier LLM classification and metadata generation |
batch_reingest.py |
collectors/batch_reingest/ |
Orchestrate the two workers across the full document corpus |
import_books.py |
collectors/batch_reingest/ |
BookBuddy CSV import to catalog_items + items_book |
import_movies.py |
collectors/batch_reingest/ |
MovieBuddy CSV import to catalog_items + items_video + items_tv_show |
librarian_db |
MySQL | Unified catalog database |
Archive/documents/ |
Filesystem | Unified per-document artifact layout |
| Librarian UI | Collaboratory/librarian/ |
Search, filter, pipeline dashboard, Hypatia chat, manual review queue |
2.3 Worker Framework Context
The two ingestion workers execute inside the Macroscope worker framework, a filesystem job queue at ~/Macroscope/Projects/Live/workers/ managed by launchd. Jobs are dispatched by an MCP tool (dispatch_worker_job) that writes a JSON payload to the worker inbox, and the worker writes its result as {job}.result.json on completion. The queue provides isolation between the MCP layer and long-running ingestion work, allows streaming jobs to be triggered from Claude Desktop conversations without blocking the UI, and gives launchd a uniform supervision target.
3. Preprocessing Pipeline
The preprocessing tools are filesystem-only utilities that prepare a clean, stable Input directory before any worker-based ingestion begins. Each tool is invoked manually, is dry-run by default, accepts --execute to commit changes, and writes a JSONL manifest that the downstream batch reingest tool consumes. All operations are reversible; nothing is deleted.
3.1 flatten_pdfs.py
Walks ~/Macroscope/Archive/Documents/PDF/ recursively, moves every .pdf file (case-insensitive) to ~/Macroscope/Archive/Documents/Input/ as a flat directory, and writes flatten_manifest.jsonl recording every move. Filename collisions are resolved by appending _1, _2, etc. to the stem. Non-PDF files are left in place. The original nested directory is preserved until the operator manually removes it.
Empirical result on the April 11, 2026 run: 8,515 PDFs flattened, 788 name collisions auto-suffixed, 50.97 GB total.
3.2 dedup_pdfs.py
Hashes every PDF in Input/ with SHA-256 (1 MB streaming chunks), groups files by hash, selects a canonical member per group (shortest filename with lexicographic tiebreak), and moves non-canonical aliases to ~/Macroscope/Archive/Documents/Input_duplicates/. Writes dedup_manifest.jsonl in Input/ with one record per unique document containing the canonical path, file size, alias count, and full list of alias filenames. Only catches byte-for-byte duplicates; near-duplicate detection via normalized-text SHA-256 is performed later during worker ingestion (see Section 6.7).
Empirical result: 5,298 unique documents, 3,217 aliases moved, 18.70 GB reclaimed.
3.3 rename_pdfs.py
Assigns sequential ASCII-safe filenames of the form input_NNNNN.pdf (five-digit zero-padded, lowercase) to every PDF in Input/ in alphabetic order and writes rename_manifest.jsonl in Input/ with one record per file containing sequence, new_name, original_name, and size_bytes. Uses ensure_ascii=False to preserve original filenames with Unicode punctuation, emoji, embedded newlines, or leading whitespace. Refuses to run if the manifest already exists or if any existing filename matches the target pattern, enforcing single-shot idempotency.
Empirical result: 5,299 files renamed (includes one file manually restored from Input_duplicates).
3.4 Manifest Contract
The three manifests together form a complete audit trail for every byte in the Input directory. The batch reingest tool reads all three at startup and constructs the mapping from sequential input filename to original filename, alias list, and pre-flatten source path. The original filename becomes the value of catalog_items.original_filename and is preserved verbatim regardless of what characters it contained.
4. Worker Agent Architecture
4.1 Worker Framework Overview
Each worker is a Python module at workers/<worker_name>/worker.py with a standard entry point that takes a single argument: the path to a JSON job file in the worker inbox. The worker reads the job, performs its task, and writes a sibling .result.json file. On success the job file is moved to done/; on failure it is moved to failed/ with a .error.json sibling containing stderr and traceback. The worker framework supports concurrent execution via launchd agent cloning but is run single-threaded for ingestion to avoid rate-limit contention on external APIs.
4.2 Job Contract
Every worker job carries the following minimum envelope:
{
"job_id": "uuid-v4",
"job_type": "read_document_basic",
"payload": { "...": "..." },
"meta": {
"caller": "batch_reingest",
"batch_job_id": 4271,
"created_at": "2026-04-11T18:00:00Z"
}
}
The payload structure is worker-specific. The meta.batch_job_id references the row in ingestion_batch_jobs so the orchestrator can update per-document state on job completion.
4.3 Two-Worker Split Rationale
Document extraction (PDF to Markdown, figure extraction, page counting) and document classification (bibliographic metadata generation) are split into separate workers for three reasons:
- Cost isolation. Extraction is local-CPU work with no external API cost. Classification may involve Anthropic API calls charged per token. Splitting the workers lets the orchestrator retry classification without repaying extraction.
- Resumability. If classification fails midway through the batch, the orchestrator can re-run
classify_documentagainst the already-extracted Markdown without re-reading the source PDF. - Model substitution. A future worker variant (e.g.,
read_document_ocrfor image-only PDFs) can be slotted in upstream of classification without changes to the classifier.
The read_document_olmocr worker, which provided heavier LLM-based extraction for scanned and image-only PDFs, is retired in this specification and its source is preserved under Projects/Reference/ for historical consultation.
5. Document Extraction (read_document_basic)
5.1 Extraction Engine
The worker wraps EnhancedPDFExtractor located at ~/Macroscope/Projects/Live/collectors/scripts/enhanced_pdf_extractor.py with the use_llm=False flag. The extractor uses PyMuPDF (fitz) as the primary engine for text and layout, falling back to pdfplumber for table detection when PyMuPDF's layout heuristics indicate tabular content. No LLM is invoked at this stage; all classification is delegated to the downstream classify_document worker.
5.2 Payload Contract
{
"pdf_path": "/Users/.../Input/input_00042.pdf",
"output_markdown_path": "/tmp/.../input_00042.md",
"output_metadata_path": "/tmp/.../input_00042.meta.json",
"output_figures_dir": "/tmp/.../input_00042.figures/",
"workspace_dir": "/tmp/.../workspace/"
}
5.3 Output Artifacts
- Markdown file — the extracted text, with headings inferred from font-size hierarchy, embedded image placeholders (
), and table blocks in pipe syntax where detected. - Metadata JSON — shallow extraction-stage metadata:
num_pages,num_figures,num_tables,page_size,producer,pdf_version,has_toc,has_embedded_text, plus any bibliographic hints discovered in the PDF's own/Infodictionary (title, author, subject, keywords). - Figures directory — one file per extracted figure, named
figure_NN.png(or original format where preserved).
5.4 Limitations
The basic extractor does not perform OCR, will return empty Markdown for image-only PDFs, and does not attempt to reconstruct column order in complex multi-column layouts. Documents flagged as "no extractable text" by PyMuPDF are written to ingestion_batch_jobs with state = 'extraction_failed' and classification_status = 'manual_review', bypassing the classifier entirely.
6. Three-Tier Classification Ladder (classify_document)
6.1 Escalation Policy
The classify_document worker attempts bibliographic metadata generation through a sequence of methods ordered by cost and reliability, advancing to the next tier only when the current tier returns insufficient confidence or fails entirely. The policy is deterministic: given the same input Markdown, a document takes the same path through the ladder on every run.
Confidence is expressed as an enumeration (low, medium, high) and the worker advances if the current tier returns low. Each tier has the opportunity to supplement fields already filled by earlier tiers rather than replacing them wholesale (see Section 6.6).
6.2 Tier 1: Deterministic Regex
A pure-Python module inspects the first 4 kilobytes of extracted Markdown and the original filename for deterministic bibliographic patterns: arXiv identifiers (arXiv:YYMM.NNNNN), DOI prefixes (10.NNNN/...), standard journal-article headers (title on line N, authors on line N+1, affiliations indented), USGS and NOAA report numbers, and a small set of common preprint server URLs. When a pattern is found, the module resolves the bibliographic record by local lookup (no network call) and returns a metadata dict with extraction_confidence = 'high' and extraction_method = 'regex'.
Approximately 40–60% of well-formatted post-2005 journal articles resolve at Tier 1 with zero LLM cost. The tier is always attempted first; it is free and fast.
6.3 Tier 2: Ollama gemma4:31b-cloud
When Tier 1 returns low confidence, the worker sends the first 8,000 characters of extracted Markdown (plus the filename and any Tier 1 partial metadata) to a local Ollama daemon running on localhost:11434/api/chat with model = "gemma4:31b-cloud". The model is served by Ollama Turbo as a cloud-hosted inference endpoint; authentication is handled transparently by the local daemon using a device key obtained via ollama signin, so no Bearer token appears in the Python code path.
Request parameters: format = "json", num_ctx = 8192, num_predict = 1024, keep_alive = "30m". Transport is urllib HTTPS POST with a 60-second timeout. The system prompt instructs the model to return a strict JSON object matching the v2.0 metadata schema, with empty strings for fields it cannot determine.
Measured latency (April 11, 2026 validation): approximately 0.7 seconds for trivial inputs, 1–3 seconds for typical journal articles. Cost is zero at the point of call, absorbed by the Ollama Turbo subscription already paid for at the infrastructure level.
The returned metadata is merged with Tier 1 output, the extraction method is recorded as llm with llm_model = "gemma4:31b-cloud", and confidence is set from the model's own self-assessment clamped to the valid enumeration range.
6.4 Tier 3: Anthropic Claude Haiku / Sonnet
When Tier 2 returns low confidence or produces malformed JSON, the worker invokes the Anthropic API with Claude Haiku 4.5 as the default model. The same first 8,000 characters of Markdown are sent with a strict system prompt, and the response is parsed as JSON. If Haiku also returns low confidence or the document exceeds an empirically-determined complexity threshold (length, number of authors, presence of non-English fragments), the worker escalates a second time to Claude Sonnet 4.6 for a final attempt.
API credentials are loaded from an environment variable at worker start; the key never appears in job payloads or result files. Token usage is captured from the response and recorded in ingestion_batch_jobs for cost accounting.
Cost envelope (estimate, April 2026 pricing): Haiku at approximately $0.001 per document, Sonnet at approximately $0.01 per document. For a 5,300-document backfill with an expected 10% Tier 3 rate and 20% of those escalating to Sonnet, total Tier 3 cost is on the order of $10–$15 for the full batch.
6.5 Tier 4: Manual Review Flag
If Tier 3 Sonnet also fails to return adequate metadata, the worker sets classification_status = 'manual_review' on the catalog_items row and writes a human-readable explanation to classification_notes describing why each tier was insufficient. The document is still written to the archive with partial metadata (whatever Tier 1 and Tier 2 produced), and the Librarian UI surfaces it in an admin queue for human curation.
6.6 Metadata Merge Semantics
Because each tier may run even after a partial Tier 1 resolution, the worker uses additive merge semantics: later-tier values supplement but do not overwrite earlier-tier values unless the earlier value was an empty string. The exception is extraction_confidence, which uses a max-rank semantics (low < medium < high) so that any high-confidence field elevates the document's overall confidence even if other fields remain uncertain. The merge logic is implemented in classify_document/worker.py::merge_metadata().
6.7 Normalized-Text Deduplication
After extraction but before tier dispatch, the worker computes sha256_text, a SHA-256 over the normalized Markdown (lowercase, whitespace collapsed, headings stripped). If this hash matches an existing row in items_document, the worker short-circuits: no classification is performed, and the new ingestion is recorded as an alias of the existing document in document_aliases. This catches near-duplicates that differ only in PDF rendering but carry identical text content — a common case with preprint-to-published transitions and multiple journal reprints.
7. Metadata Schema v2.0
The v2.0 metadata schema is the canonical format for the per-document metadata.json file stored alongside each PDF in the archive. It is structured as six top-level sections, each populated by a specific pipeline stage.
7.1 source
Populated by the preprocessing and extraction stages. Fields: original_filename, source_type (one of legacy_archive, streaming_upload, api_import), source_path (pre-flatten filesystem path where the file was first discovered), source_url (empty for legacy archive, set for API-imported documents), ingested_by (operator identifier), ingested_at (ISO timestamp), archive_path (target path under Archive/documents/).
7.2 bibliography
Populated by the classification stage. Fields: title (string), authors (ordered array of strings), doi (nullable string), journal (nullable string), publication_date (ISO date, nullable), abstract (string), keywords (array of strings), document_type (enum: journal_article, conference_paper, preprint, book, book_chapter, report, thesis, working_paper, field_note, other).
7.3 classification
Populated by the classification stage. Fields: extraction_method (regex | llm | hybrid | manual), extraction_confidence (low | medium | high), llm_model (nullable string naming the model that produced the final metadata), classification_status (pending | complete | manual_review), classification_notes (free text), tier1_attempted, tier2_attempted, tier3_attempted (booleans), token_usage_input, token_usage_output (integers, for Tier 3 only).
7.4 extraction
Populated by read_document_basic. Fields: num_pages, num_figures, num_tables, num_references, page_size, producer, pdf_version, has_toc, has_embedded_text, extraction_duration_ms.
7.5 visuals
Populated by read_document_basic. Fields: figures_path (relative path to figures directory under archive), visuals_sidecar_path (path to visuals.json sidecar listing each figure with its filename, page number, and placeholder caption). Detailed per-figure metadata (captions, bounding boxes, OCR text) is deferred to a future pipeline extension.
7.6 content_hashes
Populated by read_document_basic and stored alongside the catalog item row. Fields: sha256_pdf (hash of source bytes, used for exact-duplicate detection), sha256_text (hash of normalized Markdown, used for near-duplicate detection), size_bytes.
8. Extensible Catalog Model and Database Schema
8.1 Design Rationale
The original Librarian (Hypatia, circa 2024–2025) maintained separate tables for books (My_Books, 23 columns) and PDF documents (pdf_documents, 20 columns) with no shared identity layer. The new Librarian unifies all holdings under a single supertable (catalog_items) that carries the universal fields shared by every item type, plus per-type extension tables that carry specialized metadata, plus a shared asset table for files attached to items. This design supports the four launch item types (document, book, video, tv_show) and accommodates future types (audio recordings, photographs, field specimens) by adding a new extension table and a new item_type enum value without altering the core supertable.
8.2 Database: librarian_db
All catalog tables reside in a new MySQL database named librarian_db, created on both Data and Galatea. This database supersedes both the old Archive_DB (Hypatia) and the macroscope.documents legacy table. Existing data in those tables is migrated by the CSV import pipeline (Section 14) and the batch reingest workflow (Section 10) respectively; the legacy tables are not dropped but are no longer written to.
8.3 Canonical Identifiers
Every catalog item receives a canonical identifier derived from its auto-increment id and item_type:
| Item Type | Prefix | Format | Example |
|---|---|---|---|
| document | ARC-DOC | ARC-DOC-NNNNNN | ARC-DOC-000042 |
| book | ARC-BOK | ARC-BOK-NNNNNN | ARC-BOK-001311 |
| video | ARC-VID | ARC-VID-NNNNNN | ARC-VID-000614 |
| tv_show | ARC-TVS | ARC-TVS-NNNNNN | ARC-TVS-000028 |
The prefix convention inherits from Hypatia's ARC-PDF-NNNNNN scheme (renamed to ARC-DOC to be format-neutral, since future documents may be EPUB, HTML, or plain text). Hypatia's other prefixes (ARC-TAG, ARC-DIG, ARC-QOU) are available for reuse if the tag and quotes subsystems are reintroduced.
The canonical_id is computed as CONCAT(prefix, '-', LPAD(id, 6, '0')) and stored on the row. It is the stable external reference used in archive paths, UI URLs, and cross-references.
8.4 catalog_items Table (Supertable)
CREATE TABLE catalog_items (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
canonical_id VARCHAR(20) NOT NULL,
item_type ENUM('document','book','video','tv_show') NOT NULL,
title VARCHAR(512) NOT NULL,
subtitle VARCHAR(512) NULL,
creators_json JSON NULL, -- [{name, role}] — authors, directors, editors, etc.
year SMALLINT NULL,
genre VARCHAR(256) NULL, -- comma-separated, matching Buddy app convention
summary TEXT NULL,
keywords TEXT NULL,
language VARCHAR(64) NULL,
original_language VARCHAR(64) NULL,
cover_image_path TEXT NULL,
physical_location VARCHAR(256) NULL,
status VARCHAR(32) NOT NULL DEFAULT 'cataloged',
-- cataloged, on_shelf, loaned, lost, wishlist, digital_only,
-- watched, unread, read, reading, not_watched
rating DECIMAL(3,1) NULL, -- 0.0–10.0
acquired_at DATE NULL,
cost DECIMAL(10,2) NULL,
notes TEXT NULL,
tags TEXT NULL, -- comma-separated
original_filename TEXT NULL, -- preserved verbatim from source
import_source VARCHAR(64) NULL, -- 'bookbuddy', 'moviebuddy', 'ingestion_pipeline', 'manual'
import_batch_id VARCHAR(64) NULL, -- links back to the import run
classification_status ENUM('pending','complete','manual_review') NOT NULL DEFAULT 'pending',
classification_notes TEXT NULL,
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
edited_by VARCHAR(64) NULL,
UNIQUE KEY uk_canonical_id (canonical_id),
KEY idx_item_type (item_type),
KEY idx_year (year),
KEY idx_classification_status (classification_status),
FULLTEXT KEY ft_catalog_search (title, keywords, summary, genre, notes)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
8.5 items_document Extension Table
CREATE TABLE items_document (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
catalog_item_id BIGINT NOT NULL, -- FK to catalog_items.id
doi VARCHAR(256) NULL,
journal_name VARCHAR(256) NULL,
volume VARCHAR(32) NULL,
issue VARCHAR(32) NULL,
pages VARCHAR(64) NULL,
publication_date DATE NULL,
citation_text TEXT NULL,
source_url TEXT NULL,
source_type VARCHAR(32) NULL, -- legacy_archive, streaming_upload, api_import
source_path TEXT NULL, -- pre-flatten filesystem path
document_type ENUM('journal_article','conference_paper','preprint','book',
'book_chapter','report','thesis','working_paper',
'field_note','other') NOT NULL DEFAULT 'other',
extracted_text LONGTEXT NULL,
disciplines_json JSON NULL,
extraction_method VARCHAR(64) NULL, -- regex, llm, hybrid, manual
extraction_confidence ENUM('low','medium','high') NULL,
llm_model VARCHAR(64) NULL,
tier_reached TINYINT NULL, -- 1..4
sha256_pdf CHAR(64) NULL,
sha256_text CHAR(64) NULL,
size_bytes BIGINT NULL,
num_pages INT NULL,
num_figures INT NULL,
num_tables INT NULL,
archive_path TEXT NULL, -- relative path under Archive/documents/
metadata_path TEXT NULL,
schema_version SMALLINT NOT NULL DEFAULT 2,
FOREIGN KEY (catalog_item_id) REFERENCES catalog_items(id) ON DELETE CASCADE,
UNIQUE KEY uk_catalog_item (catalog_item_id),
UNIQUE KEY uk_sha256_pdf (sha256_pdf),
KEY idx_sha256_text (sha256_text),
KEY idx_document_type (document_type),
KEY idx_doi (doi),
FULLTEXT KEY ft_extracted_text (extracted_text),
FULLTEXT KEY ft_doc_metadata (citation_text, disciplines_json)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
8.6 items_book Extension Table
CREATE TABLE items_book (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
catalog_item_id BIGINT NOT NULL, -- FK to catalog_items.id
isbn VARCHAR(20) NULL,
issn VARCHAR(16) NULL,
publisher VARCHAR(256) NULL,
place_of_publication VARCHAR(256) NULL,
publication_date DATE NULL,
original_publication_date DATE NULL,
page_count INT NULL,
word_count INT NULL,
series VARCHAR(256) NULL,
volume VARCHAR(32) NULL,
edition VARCHAR(64) NULL,
book_type ENUM('hardcover','paperback','ebook','audiobook','pdf') NOT NULL DEFAULT 'paperback',
ddc VARCHAR(32) NULL, -- Dewey Decimal
lcc VARCHAR(32) NULL, -- Library of Congress
lccn VARCHAR(32) NULL, -- LC Control Number
oclc VARCHAR(32) NULL, -- OCLC number
google_volume_id VARCHAR(32) NULL,
narrator VARCHAR(256) NULL, -- for audiobooks
illustrator VARCHAR(256) NULL,
translator VARCHAR(256) NULL,
audio_runtime VARCHAR(32) NULL, -- for audiobooks
dimensions VARCHAR(64) NULL,
weight VARCHAR(32) NULL,
list_price DECIMAL(10,2) NULL,
reading_status ENUM('unread','reading','read','abandoned') NULL,
date_started DATE NULL,
date_finished DATE NULL,
current_page INT NULL,
recommended_by VARCHAR(256) NULL,
activities_json JSON NULL, -- BookBuddy reading activity log
FOREIGN KEY (catalog_item_id) REFERENCES catalog_items(id) ON DELETE CASCADE,
UNIQUE KEY uk_catalog_item (catalog_item_id),
KEY idx_isbn (isbn),
KEY idx_publisher (publisher),
KEY idx_book_type (book_type),
KEY idx_series (series)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
8.7 items_video Extension Table
CREATE TABLE items_video (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
catalog_item_id BIGINT NOT NULL, -- FK to catalog_items.id
content_type ENUM('movie','movie_collection') NOT NULL DEFAULT 'movie',
runtime_minutes INT NULL,
release_date DATE NULL,
original_release_year SMALLINT NULL,
directors VARCHAR(512) NULL, -- denormalized for indexing
writers TEXT NULL,
composers VARCHAR(512) NULL,
cast_json JSON NULL, -- [{actor, character}]
production_companies TEXT NULL,
production_countries VARCHAR(256) NULL,
producers TEXT NULL,
cinematographers TEXT NULL,
editors TEXT NULL,
media_format VARCHAR(128) NULL, -- comma-separated: DVD, Blu-ray, Apple m4v, etc.
film_rating VARCHAR(8) NULL, -- PG-13, R, NR, G, PG, NC-17
sound VARCHAR(64) NULL,
aspect_ratio VARCHAR(32) NULL,
number_of_discs TINYINT NULL,
color VARCHAR(16) NULL,
packaging VARCHAR(64) NULL,
budget VARCHAR(64) NULL,
edition VARCHAR(128) NULL,
home_page TEXT NULL,
tmdb_id INT NULL,
imdb_id VARCHAR(16) NULL,
upc_ean13 VARCHAR(20) NULL,
trailer_youtube_id VARCHAR(16) NULL,
collection_type VARCHAR(64) NULL, -- for movie collections
watch_status ENUM('watched','not_watched') NULL,
date_finished DATE NULL,
activities_json JSON NULL, -- MovieBuddy watch activity log
FOREIGN KEY (catalog_item_id) REFERENCES catalog_items(id) ON DELETE CASCADE,
UNIQUE KEY uk_catalog_item (catalog_item_id),
KEY idx_tmdb_id (tmdb_id),
KEY idx_imdb_id (imdb_id),
KEY idx_content_type (content_type),
KEY idx_film_rating (film_rating),
KEY idx_directors (directors)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
8.8 items_tv_show Extension Table
CREATE TABLE items_tv_show (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
catalog_item_id BIGINT NOT NULL, -- FK to catalog_items.id
number_of_seasons INT NULL,
first_air_year SMALLINT NULL,
first_air_date DATE NULL,
last_air_year SMALLINT NULL,
last_air_date DATE NULL,
tv_creators TEXT NULL,
tv_networks VARCHAR(256) NULL,
tv_episodes_json JSON NULL, -- full episode structure from MovieBuddy
tv_season VARCHAR(64) NULL,
directors VARCHAR(512) NULL,
writers TEXT NULL,
composers VARCHAR(512) NULL,
cast_json JSON NULL, -- [{actor, character}]
production_companies TEXT NULL,
production_countries VARCHAR(256) NULL,
media_format VARCHAR(128) NULL,
film_rating VARCHAR(8) NULL,
runtime_minutes INT NULL, -- per-episode average
tmdb_id INT NULL,
imdb_id VARCHAR(16) NULL,
upc_ean13 VARCHAR(20) NULL,
trailer_youtube_id VARCHAR(16) NULL,
home_page TEXT NULL,
watch_status ENUM('watched','not_watched') NULL,
date_finished DATE NULL,
activities_json JSON NULL,
FOREIGN KEY (catalog_item_id) REFERENCES catalog_items(id) ON DELETE CASCADE,
UNIQUE KEY uk_catalog_item (catalog_item_id),
KEY idx_tmdb_id (tmdb_id),
KEY idx_imdb_id (imdb_id),
KEY idx_first_air_year (first_air_year)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
8.9 item_assets Table
CREATE TABLE item_assets (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
catalog_item_id BIGINT NOT NULL, -- FK to catalog_items.id
asset_type ENUM('pdf','epub','mp4','mkv','m4v','avi',
'cover','thumbnail','poster','figure') NOT NULL,
file_path TEXT NOT NULL,
file_name VARCHAR(512) NOT NULL,
file_size BIGINT NULL,
sha256 CHAR(64) NULL,
mime_type VARCHAR(128) NULL,
extracted TINYINT NOT NULL DEFAULT 0, -- 1 if text extraction has been run
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (catalog_item_id) REFERENCES catalog_items(id) ON DELETE CASCADE,
KEY idx_catalog_item (catalog_item_id),
KEY idx_asset_type (asset_type),
KEY idx_sha256 (sha256)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
8.10 document_aliases Table
CREATE TABLE document_aliases (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
catalog_item_id BIGINT NOT NULL, -- FK to catalog_items.id (must be item_type='document')
alias_filename TEXT NOT NULL,
alias_source_path TEXT NULL,
discovered_at DATETIME NOT NULL,
discovery_method VARCHAR(32) NOT NULL, -- 'sha256_pdf' or 'sha256_text'
FOREIGN KEY (catalog_item_id) REFERENCES catalog_items(id) ON DELETE CASCADE,
KEY idx_catalog_item (catalog_item_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
8.11 ingestion_batch_jobs Table
CREATE TABLE ingestion_batch_jobs (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
input_filename VARCHAR(32) NOT NULL, -- input_NNNNN.pdf
original_filename TEXT NOT NULL,
sha256_pdf CHAR(64) NOT NULL,
size_bytes BIGINT NOT NULL,
state ENUM('pending','extracting','extracted',
'classifying','classified','written',
'failed','skipped_duplicate') NOT NULL,
catalog_item_id BIGINT NULL, -- set when state='written'
attempt_count INT NOT NULL DEFAULT 0,
last_error TEXT NULL,
tier_reached TINYINT NULL, -- 1..4
extract_started_at DATETIME NULL,
extract_ended_at DATETIME NULL,
classify_started_at DATETIME NULL,
classify_ended_at DATETIME NULL,
written_at DATETIME NULL,
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
KEY idx_state (state),
KEY idx_sha256_pdf (sha256_pdf)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
8.12 catalog_tags Table
CREATE TABLE catalog_tags (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
tag VARCHAR(128) NOT NULL,
catalog_item_id BIGINT NOT NULL,
FOREIGN KEY (catalog_item_id) REFERENCES catalog_items(id) ON DELETE CASCADE,
UNIQUE KEY uk_tag_item (tag, catalog_item_id),
KEY idx_tag (tag)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
8.13 FULLTEXT Index Strategy
The FULLTEXT index strategy inherits from the original Librarian's three-index approach and extends it for the unified catalog:
ft_catalog_searchoncatalog_items(title, keywords, summary, genre, notes)— the narrow metadata index shared by all item types. This is the primary search target for the Librarian's main search box.ft_extracted_textonitems_document(extracted_text)— the deep body index for document content only. Used when the user explicitly requests full-text search.ft_doc_metadataonitems_document(citation_text, disciplines_json)— supplemental metadata index for scholarly search.
Short-keyword and quoted-phrase search uses the REGEXP/LIKE fallback layer from the original Librarian's functions.php, bypassing FULLTEXT's innodb_ft_min_token_size minimum (default 3 characters). This layer is implemented in the Librarian UI's PHP search module.
9. Archive Filesystem Layout
Per-document artifacts are stored under a unified archive layout keyed by canonical_id:
~/Macroscope/Archive/documents/
<year>/ # derived from publication_date, or 'undated/'
ARC-DOC-NNNNNN/
source.pdf # original bytes, post-rename
extracted.md # output of read_document_basic
metadata.json # v2.0 metadata (six sections)
figures/
figure_01.png
figure_02.png
...
visuals.json # sidecar with per-figure metadata
thumbnail.png # first-page thumbnail (optional)
Book and video items do not require archive directories unless they have associated file assets (ebooks, ripped video files). When file assets exist, they are stored at paths recorded in item_assets.file_path. Cover images downloaded during CSV import are stored at the path in catalog_items.cover_image_path.
Sibling buckets under ~/Macroscope/Archive/ are reserved for future expansion:
~/Macroscope/Archive/
documents/ # PDF documents (active, this spec)
books/ # ebook files (future, when ebook ingestion is built)
video/ # ripped video files (future)
audio/ # audio recordings (future)
images/ # photographs (future)
10. Batch Reingest Workflow
The batch reingest tool batch_reingest.py orchestrates the full document corpus ingestion. On startup it reads the three preprocessing manifests, populates ingestion_batch_jobs with one row per input_NNNNN.pdf file in state pending, and then enters a dispatch loop.
For each pending row, the orchestrator:
- Transitions state to
extracting, dispatches aread_document_basicjob, waits for the result. - Checks
sha256_textagainst existingitems_documentrows. If match, transitions state toskipped_duplicate, records the alias indocument_aliases, and continues. - Transitions state to
classifying, dispatches aclassify_documentjob, waits for the result. - Assembles the full v2.0 metadata record by merging extraction output and classification output.
- Inserts a
catalog_itemsrow withitem_type = 'document', derivescanonical_id = 'ARC-DOC-' + LPAD(id, 6, '0'). - Inserts the corresponding
items_documentrow with all classification and extraction fields. - Moves the source PDF, extracted.md, metadata.json, figures, and visuals sidecar into the archive layout.
- Moves the source PDF in Input/ to
~/Macroscope/Archive/Documents/Input_ingested/. - Transitions state to
written, setscatalog_item_id, and continues to the next pending row.
On crash or interrupt, the orchestrator can be restarted cleanly: any row in extracting or classifying is reset to pending on the next run, and all intermediate artifacts in the workspace directory are discarded. Rows in written are untouched. This gives full resumability with no risk of double-inserting a document.
The orchestrator logs every tier decision, every cost-bearing API call, and every state transition to ~/Macroscope/Projects/Live/collectors/batch_reingest/logs/batch_reingest.log.
11. Librarian UI Integration
The Librarian UI is a web panel rendered under ~/Macroscope/Projects/Workbench/Collaboratory/librarian/ and served at http://localhost/Projects/Workbench/Collaboratory/librarian/. It reads from librarian_db tables and from the Archive/documents/ filesystem layout. This specification defines only the read contract and UX patterns; the UI implementation is documented separately.
11.1 Unified Browse and Search
The default view lists all catalog_items rows, with tab-based or sidebar filtering by item_type. Within each type, filters include year range, genre, keyword, and type-specific fields (document_type for documents, book_type for books, film_rating for videos, network for TV shows). Full-text search is performed against the ft_catalog_search FULLTEXT index, with the REGEXP/LIKE fallback layer for short keywords and quoted phrases.
A single search box at the top of the page accepts natural language queries via the "Ask the Librarian" (Hypatia) interface described in Section 11.5.
11.2 Item Detail View
Each item has a detail page showing:
- All types: Title, creators, year, genre, summary, keywords, cover image, physical location, status, rating, acquisition info, tags, alias list (for documents).
- Documents: Rendered Markdown content, inline thumbnail, figures gallery, classification provenance (tier, model, confidence), download link to original PDF.
- Books: Publisher, ISBN, series/volume, page count, edition, DDC/LCC, reading progress (current page, dates started/finished), BookBuddy activity log.
- Videos: Director, cast, runtime, film rating, format (DVD/Blu-ray/digital), TMDB/IMDB links, trailer embed (YouTube), watch status, MovieBuddy activity log.
- TV Shows: Creators, networks, season/episode structure, air date range, cast, TMDB/IMDB links.
11.3 Pipeline Progress Dashboard
An admin-only view reads ingestion_batch_jobs and shows the real-time state of the batch reprocessing run: total count, count per state, tier distribution, cumulative Tier 3 token cost, average latency per tier, and a live-updating progress bar. Rows in failed state are clickable and surface the error detail for debugging.
11.4 Manual Review Queue
A second admin-only view lists all catalog items with classification_status = 'manual_review'. Each row is editable: the curator can fill in missing fields directly, promote the item to classification_status = 'complete', or delete the row if unsuitable for the catalog.
11.5 Hypatia: AI Librarian Persona
The Librarian UI inherits and extends the "Hypatia" AI persona from the original Librarian project. Hypatia serves two functions:
-
Natural-language intent-to-query translation. A query like "find me anything about mycorrhizal networks published after 2015" is sent to a small LLM (Ollama gemma4:31b-cloud) with a collection-statistics preamble, and the model returns structured search parameters (keywords, field:value filters, year range) that drive the PHP search engine. This is strictly cheaper and more predictable than letting the LLM perform retrieval itself.
-
Per-item chat. On any item's detail page, the user can open a chat session with Hypatia contextualized to that specific item. For documents, this includes the extracted text; for books, the summary and metadata; for movies, the cast and plot summary. Conversation memory is session-scoped with a rolling 3-exchange window (inherited from the original
ask_llm.phppattern). The LLM model is selectable via a dropdown (gemma4:31b-cloud default, Claude Haiku or Sonnet for deeper analysis).
11.6 Recommendation Engine
The item detail page includes a "Related Items" section using a cascade recommendation strategy inherited from the original Librarian's getBookRecommendations():
- FULLTEXT MATCH on shared keywords (cross-type: a book about sensor networks may recommend a related journal article).
- Same genre within the same item type.
- Same creator within the same item type.
- Same series (books) or same collection (movies).
- Same ingest batch (documents).
The cascade stops when five recommendations are found.
11.7 Statistics Dashboard
A public-facing statistics page shows aggregate counts and distributions: total items by type, unique creators, genre distributions, year histograms, format breakdowns (book_type, media_format), reading/watch status summaries, and collection growth over time. Inherited from the original Librarian's statistics.php and expanded to cover all item types.
12. Open Questions
The following decisions are deferred to a subsequent specification revision and are not blocking the initial batch import runs:
- Visuals storage depth. Whether to create a dedicated
document_visualstable for per-figure metadata or defer to avisuals.jsonsidecar. - Thumbnail generation tool. Candidates include macOS
sips, ImageMagick, and the Pythonpdf2imagelibrary. Selection depends on available dependencies on Data and Galatea. - Pre-existing
PDFs/versuspreprocessed/directory handling. Legacy archive trees may contain already-preprocessed Markdown and metadata files from earlier ingestion attempts; the pipeline must decide whether to trust or reprocess them. - Non-PDF content in the legacy archive. EPUB, HTML, and plain-text documents currently sit alongside PDFs in the archive source tree. A separate reader worker family will be needed.
- OCR tier reintroduction. Whether and when to add an OCR-based extraction path for image-only PDFs.
- Cover image sourcing. Whether to download cover images from Google Books (books) and TMDB (movies/TV) during CSV import, or to defer to manual upload. Storage path convention for covers needs definition.
- Ebook-as-book ingestion. An ebook (item_type='book', book_type='ebook') that has an associated PDF or EPUB file asset should route through the document extraction pipeline for full-text indexing. The trigger mechanism (manual versus automatic on asset upload) is undecided.
- Hypatia model selection. Whether the per-item chat should default to gemma4:31b-cloud or to a smaller model for cost efficiency on high-volume usage.
- Cross-machine sync for librarian_db. Whether to replicate librarian_db from Galatea to Data (like macroscope), from Data to Galatea, or maintain independent instances. The catalog is personal-use only and will likely be managed from Data.
These are tracked as open items and will be resolved in v0.3 of this specification.
13. References
[1] Hamilton, M. P. (2026). "Macroscope Collaboratory: MCP-Hosted Research Application." Canemah Nature Laboratory Technical Note CNL-TN-2026-047.
[2] Hamilton, M. P. (2026). "Canemah Nature Laboratory Technical Note Style Guide." CNL-SG-2025-002 v1.1.
[3] Artifex Software (2024). "PyMuPDF: Python bindings for MuPDF." https://pymupdf.readthedocs.io (accessed April 11, 2026).
[4] Singer-Vine, J. et al. (2024). "pdfplumber: Plumb a PDF for detailed information." https://github.com/jsvine/pdfplumber (accessed April 11, 2026).
[5] Ollama (2025). "Ollama: Run Large Language Models Locally." https://ollama.ai (accessed April 11, 2026).
[6] Gemma Team (2025). "Gemma 3 Technical Report." Google DeepMind. arXiv:2503.19786.
[7] Anthropic (2026). "Claude API Reference." https://docs.claude.com (accessed April 11, 2026).
[8] Hamilton, M. P. (2026). "STRATA 2.0 Intelligence Architecture." Canemah Nature Laboratory Technical Note CNL-TN-2026-043.
[9] Hamilton, M. P. (2024–2025). "The Librarian (Hypatia)." Unpublished internal project. ~/Macroscope/Projects/Reference/MacroNexus/Active/Tools/The-Librarian/.
14. CSV Import Pipeline
14.1 Overview
Book and movie/TV records are imported from CSV exports produced by the BookBuddy and MovieBuddy iOS applications. Both apps belong to the same family and share a common export structure: comma-delimited with a header row, UTF-8 encoding, JSON-encoded arrays for structured fields (cast, episodes, activities), and comma-separated strings for multi-value fields (genres, formats, languages).
The import scripts are standalone Python utilities in collectors/batch_reingest/ that read one CSV file, map columns to the catalog schema, and insert rows into librarian_db. Each script is idempotent: re-running against the same CSV skips rows that already exist (matched on title + creator + year composite key for books, TMDB ID or title + year for videos).
14.2 import_books.py — BookBuddy CSV Import
Source: BookBuddy CSV export (80 columns, 1,311 rows in the April 11, 2026 export).
Column mapping to catalog_items:
| BookBuddy Column | Target Column | Notes |
|---|---|---|
| Title | title | |
| Subtitle | subtitle | |
| Author | creators_json | [{name: Author, role: "author"}]; Illustrator, Narrator, Translator, Photographer, Editor added with respective roles when present |
| Year Published | year | Five rows have garbled value "100"; cleaned to NULL |
| Genre | genre | Verbatim |
| Summary | summary | |
| Language | language | |
| Original Language | original_language | |
| Rating | rating | Scale 0.0–5.0, multiplied by 2 for 0.0–10.0 normalization |
| Physical Location | physical_location | Empty in current export |
| Status | status | Mapped: Read→read, Reading→reading, Unread→unread |
| Date Added | created_at | |
| Tags | tags | |
| Purchase Date | acquired_at | |
| Purchase Price | cost | |
| Notes | notes | |
| Category | (used for items_book.book_type) | Mapped: hardcover→hardcover, paperback→paperback, ebook→ebook |
| Uploaded Image URL | cover_image_path | Stored as-is; future: download and localize |
Column mapping to items_book:
| BookBuddy Column | Target Column |
|---|---|
| ISBN | isbn |
| ISSN | issn |
| Publisher | publisher |
| Place of Publication | place_of_publication |
| Date Published | publication_date |
| Original Date Published | original_publication_date |
| Number of Pages | page_count |
| Word Count | word_count |
| Series | series |
| Volume | volume |
| Edition | edition |
| Category | book_type |
| DDC | ddc |
| LCC | lcc |
| LCCN | lccn |
| OCLC | oclc |
| Google VolumeID | google_volume_id |
| Narrator | narrator |
| Illustrator | illustrator |
| Translator | translator |
| Audio Runtime | audio_runtime |
| Dimensions | dimensions |
| Weight | weight |
| List Price | list_price |
| Date Started | date_started |
| Date Finished | date_finished |
| Current Page | current_page |
| Recommended By | recommended_by |
| Activities | activities_json |
Data cleaning:
- Year Published values of "100" or other non-four-digit values are set to NULL.
- Status mapping: BookBuddy uses "Read"/"Reading"/"Unread"; mapped to lowercase enums.
- Category "ebook" maps to both
catalog_items.status = 'digital_only'anditems_book.book_type = 'ebook'. - Author (Last, First) column is ignored; the plain Author column is used for display.
canonical_idis assigned asARC-BOK-+ zero-paddedcatalog_items.idafter insert.import_sourceset to'bookbuddy'.
14.3 import_movies.py — MovieBuddy CSV Import
Source: MovieBuddy CSV export (81 columns, 652 rows in the April 11, 2026 export).
The import script reads the CSV once and routes each row to one of three extension tables based on the Content Type column:
| Content Type | item_type | Extension Table |
|---|---|---|
| Movie | video | items_video |
| Movie Collection | video | items_video (content_type = 'movie_collection') |
| TV Show | tv_show | items_tv_show |
Column mapping to catalog_items:
| MovieBuddy Column | Target Column | Notes |
|---|---|---|
| Title | title | |
| Original Title | subtitle | Used as subtitle when different from Title |
| Directors / TV Creators | creators_json | [{name, role: "director"}] or [{name, role: "creator"}] |
| Release Year / First Air Year | year | |
| Genres | genre | Verbatim comma-separated |
| Summary | summary | |
| Languages | language | First language in comma-separated list |
| Rating | rating | Already 0.0–10.0 scale |
| Physical Location | physical_location | |
| Status | status | Mapped: Watched→watched, Not Watched→not_watched |
| Date Added | created_at | |
| Tags | tags | |
| Purchase Date | acquired_at | |
| Purchase Price | cost | |
| Notes | notes | |
| Uploaded Image URL | cover_image_path |
Column mapping to items_video (movies):
| MovieBuddy Column | Target Column |
|---|---|
| Content Type | content_type |
| Runtime | runtime_minutes |
| Release Date | release_date |
| Original Release Year | original_release_year |
| Directors | directors |
| Writers | writers |
| Composers | composers |
| Cast | cast_json (parsed from MovieBuddy JSON) |
| Production Companies | production_companies |
| Production Countries | production_countries |
| Producers | producers |
| Cinematographers | cinematographers |
| Editors | editors |
| Format | media_format |
| Film Rating | film_rating |
| Sound | sound |
| Aspect Ratio | aspect_ratio |
| Number Of Discs | number_of_discs |
| Color | color |
| Packaging | packaging |
| Budget | budget |
| Edition | edition |
| Home Page | home_page |
| TMDB ID | tmdb_id |
| IMDB ID | imdb_id |
| UPC-EAN13 | upc_ean13 |
| Trailer YouTube ID | trailer_youtube_id |
| Collection Type | collection_type |
| Date Finished | date_finished |
| Activities | activities_json |
Column mapping to items_tv_show (TV shows):
Shares most video columns plus: Number Of Seasons → number_of_seasons, First Air Year/Date → first_air_year/first_air_date, Last Air Year/Date → last_air_year/last_air_date, TV Creators → tv_creators, TV Networks → tv_networks, TV Episodes → tv_episodes_json, TV Season → tv_season.
Data cleaning:
- Cast JSON is already well-formed
[{actor, character}]from MovieBuddy; parsed directly. - Format field contains comma-separated values like "Blu-ray Format,DVD Format,Apple m4v Format"; the " Format" suffix is stripped during import.
- TMDB ID is the primary dedup key for idempotent re-import.
canonical_idassigned asARC-VID-orARC-TVS-+ zero-padded id after insert.import_sourceset to'moviebuddy'.
14.4 Import Execution Order
- Create
librarian_dband all tables (Section 8). - Run
import_books.pyagainst BookBuddy CSV. Result: ~1,311 catalog_items + items_book rows. - Run
import_movies.pyagainst MovieBuddy CSV. Result: ~624 catalog_items + items_video rows, ~28 catalog_items + items_tv_show rows. - Run
batch_reingest.pyagainst the document corpus. Result: ~5,299 catalog_items + items_document rows (minus near-duplicates). - Verify counts, spot-check records, resolve any manual_review items.
Document History
| Version | Date | Changes |
|---|---|---|
| 0.1 | 2026-04-11 | Initial draft: document ingestion pipeline only |
| 0.2 | 2026-04-11 | Extensible catalog model: unified supertable with four item types (document, book, video, tv_show), item_assets table, catalog_tags table. New database librarian_db supersedes Archive_DB and macroscope.documents. CSV import pipeline for BookBuddy (1,311 books) and MovieBuddy (652 movies/TV). Hypatia AI persona and recommendation engine formalized. FULLTEXT strategy expanded to three indices. Nine new open questions. Updated abstract, objectives, scope, component inventory, archive layout, and UI integration sections. |
End of Specification
Cite This Document
BibTeX
Permanent URL: https://canemah.org/archive/document.php?id=CNL-SP-2026-050