The Collaboratory Librarian
The Collaboratory Librarian
Unified Catalog Architecture, Tag Classification Pipeline, and the Semantic Explorer as Research Instrument
Canemah Nature Laboratory Technical Note Series
Document ID: CNL-TN-2026-051 Version: 0.1 Date: April 13, 2026 Author: Michael P. Hamilton, Ph.D. Affiliation: Canemah Nature Laboratory, Oregon City, Oregon
AI Assistance Disclosure: This technical note was developed collaboratively with Claude (Anthropic, claude-opus-4-6) via Cowork. Claude contributed to architectural analysis, schema design, pipeline specification, and document drafting. The author takes full responsibility for the content, accuracy, and conclusions.
Abstract
The Macroscope Collaboratory Librarian was instantiated in early April 2026 as a catalog system for Dr. Hamilton's personal library, initially handling 1,311 books imported from BookBuddy. The system uses a polymorphic supertable design (catalog_items with type-specific extension tables) that was always intended to accommodate multiple media types: books, academic documents, videos, TV shows. A parallel system, the Quotes Explorer (quotes_db), had been built separately in late 2025 with a more mature tag architecture — normalized tag definitions, a many-to-many junction table with provenance tracking, AI-assisted bulk and conversational tagging, and a Three.js semantic explorer that visualizes tag co-occurrence as a navigable 3D topology.
This technical note documents the architectural decision to consolidate the Quotes system into the Librarian as a first-class catalog item type, to adopt the Quotes tag architecture as the universal classification backbone for all item types, to build a tiered tag enrichment pipeline modeled on the Collaboratory's existing classify_document worker, and to generalize the semantic explorer from a quotes-only visualization into a catalog-wide research instrument with scope filtering and external source integration.
The core insight is that a personal knowledge collection — books, papers, quotes, films — is not a set of independent inventories but a single intellectual topology. A tag like "mycorrhizal networks" should connect a book on the shelf, a quote from Suzanne Simard, an academic paper in the archive, and a Semantic Scholar result from the broader literature. The unified catalog makes this possible. The tag pipeline makes it accurate. The semantic explorer makes it navigable.
1. Current State and Motivation
1.1 The Librarian Today
The Collaboratory Librarian (Projects/Workbench/Collaboratory/librarian/) stores catalog items in a polymorphic schema: a shared catalog_items supertable (27 columns including title, creators, year, genre, summary, tags, rating, classification status) with type-specific extension tables (items_book, items_document, items_video, items_tv_show). The database is librarian_db.
As of this writing, the catalog holds 1,311 books imported from a BookBuddy CSV export, with an additional 1,285 records enriched via a bulk merge from Archive_DB.My_Books — a legacy database from the original Librarian tool (Projects/Reference/MacroNexus/Active/Tools/The-Librarian/) that had been enriched last fall using Google Books API lookups and Ollama AI keyword generation.
The classification problem is acute. BookBuddy's tags are inaccurate — they appear to be publisher-supplied marketing keywords rather than subject classifications. The summaries are inconsistent, ranging from useful descriptions to promotional copy. The keywords field in catalog_items is empty and unused. The tags field contains the BookBuddy data, stored as comma-separated text with a denormalized catalog_tags index table. There is no tag definition table, no provenance tracking, and no controlled vocabulary.
1.2 The Quotes System
The Quotes Explorer (Projects/Live/Galatea/CNL/projects/Quotes/) is a standalone LAMP application running on its own database (quotes_db). It has a significantly more mature classification architecture:
A normalized tags table stores tag definitions with id, tag_name, slug, and description. A quote_tags junction table links quotes to tags with an assigned_by field tracking provenance: manual (human curator), llm (AI-suggested and approved), or import (from bulk import). This provenance model is essential for understanding the quality and authority of any given tag assignment.
The tagging pipeline has three modes. A bulk AI tagger (admin/bulk_tagger.php) processes multiple quotes efficiently with a terse JSON-only prompt, suitable for fast screening passes. A conversational AI tagger (admin/ai_tagger.php) provides an interactive chat interface for individual quotes, supporting discussion of meaning and connections before tagging. Manual tag management allows direct assignment, creation, and editing through the admin interface.
The semantic explorer (explorer.php + six modular JavaScript files) visualizes tag co-occurrence as a 3D force-directed topology using Three.js. Tags are positioned as nodes; edges represent co-occurrence frequency (how often two tags appear together on the same quote). A multi-level zoom system (Cosmic, Regional, Local, Ground, Focus) adjusts visibility thresholds as the user navigates. The explorer does not impose a hierarchy — clusters emerge naturally from the co-occurrence data, reflecting the actual thematic structure of the collection.
1.3 The Classification Gap Across Item Types
The two systems diverge on classification in ways that prevent unified search and discovery:
Books have flat comma-separated tags (inaccurate, from BookBuddy), DDC/LCC library classification codes (identifiers, not searchable terms), and no discipline field.
Documents have disciplines_json (a structured academic discipline field, FULLTEXT indexed) but use the same flat tags as books.
Videos and TV shows have only the shared flat tags. No discipline field, no classification codes.
Quotes have the richest classification — normalized tags with definitions, provenance, and AI-assisted assignment — but in a completely separate database.
A search for "ecology" would find documents via disciplines_json, might find some books if their tags happened to include the word, and would miss quotes entirely because they are in a different database. This fragmentation defeats the purpose of a personal knowledge catalog.
2. Architectural Decisions
2.1 Quotes as a Catalog Item Type
Quotes will be integrated into the Librarian as item_type = 'quote' with an items_quote extension table. The extension table captures quote-specific fields: the quote text itself, attribution (person, work, page/location), source context, and the conversational tagger's chat history. The existing quotes_db data will be migrated into catalog_items + items_quote via a one-time migration script.
The quotes_db application will remain functional during and after migration, but will transition from the authoritative source to a legacy reference. New quotes will be entered through the Librarian. The existing Quotes Explorer URL (canemah.org/projects/Quotes/explorer.php) will redirect to the catalog-wide semantic explorer with scope=quote pre-set, preserving the existing experience.
2.2 Unified Tag Architecture
The Quotes system's tag architecture becomes the catalog standard, replacing the flat catalog_items.tags field and the simple catalog_tags index table. The new structure:
tags definition table. Each tag has an id, tag_name, slug, description, and created_at. The description field allows curators to define the scope of a tag — distinguishing, for example, "mycorrhizal networks" (the biological phenomenon) from "common mycorrhizal networks" (the specific research concept). The slug enables URL-friendly filtering.
catalog_item_tags junction table. Links any catalog item to any tag with provenance tracking via assigned_by: manual (human curator), llm (AI-suggested and approved), api (from Google Books, Open Library, or Semantic Scholar), or import (from CSV or database migration). A confidence field (0.0–1.0) allows the pipeline to express uncertainty. A created_at timestamp enables temporal analysis of when tags were assigned.
Deprecation of flat fields. The catalog_items.tags text field and the catalog_tags denormalized index table will be retained temporarily for backward compatibility during migration but will no longer be the authoritative source. The catalog_items.keywords field, which was never populated, will be repurposed or dropped.
2.3 Tag Taxonomy: Hybrid Approach
The tag vocabulary will be managed as a hybrid between free-form generation and post-hoc normalization, rather than a rigid controlled vocabulary. The pipeline generates tags from multiple sources (API lookups, LLM analysis, human curation), and a normalization pass merges obvious synonyms and maps to preferred forms.
The tags definition table supports this workflow: when two tags are identified as synonymous (e.g., "AI" and "artificial intelligence"), the less preferred tag's entries are remapped to the preferred tag's ID, and the deprecated tag is either deleted or retained as a soft alias. The description field on each tag serves as scope documentation, helping future tagging decisions stay consistent.
Tags should be specific rather than generic. "Embedded sensor networks" is useful; "technology" is not. "Baseline shift syndrome" is useful; "ecology" is not. Multi-word descriptive phrases are preferred over single-word categories. The target granularity is comparable to Library of Congress Subject Headings or Semantic Scholar's fields of study — specific enough to be meaningful for search and discovery, broad enough to create useful co-occurrence clusters in the explorer.
3. Tag Enrichment Pipeline
3.1 Design Principles
The enrichment pipeline follows the same design principles as the classify_document worker: cost-optimized tier escalation, confidence-gated progression, additive field filling (never overwrite), and full provenance tracking. Each tier's output is recorded in catalog_items.classification_notes as JSON, enabling audit and debugging.
3.2 Four-Tier Escalation
Tier 1: External API lookup (free/cheap, structured).
For books: Google Books API (ISBN-first, with title+author fallback) returns categories (BISAC-based subject headings) and description. Open Library API (openlibrary.org/isbn/{isbn}.json, then follow the works key) returns community-curated subjects arrays, dewey_decimal_class, and lc_classifications. These structured sources provide the broad subject anchors.
For documents: Semantic Scholar API returns fieldsOfStudy and tldr summaries. CrossRef returns subject classifications via DOI lookup.
For videos/TV shows: TMDB or OMDB APIs return genre classifications and keywords.
For quotes: No API tier — quotes already have tags from the existing system, or they proceed directly to tier 2.
The API results are mapped to tags using a normalization function that converts BISAC categories ("Science / Earth Sciences / Geology") into tag-friendly forms ("earth sciences," "geology"), splits compound subjects, and lowercases. Tags created by this tier are marked assigned_by = 'api'.
Tier 2: Ollama gemma4:31b-cloud (flat-rate, generative).
Feed the item's available metadata (title, creators, summary/description, genre, existing tags from tier 1) to gemma4:31b-cloud with a type-aware prompt. The prompt varies by item_type:
- Non-fiction books: "Given this book's title, author, and description, generate 5–10 subject tags. Focus on disciplines, methodologies, key concepts, and geographic or temporal specificity. Each tag should be 1–3 words. Do not include generic terms like 'science' or 'nature.' Return as a JSON array."
- Fiction books: "Given this novel's title, author, and description, generate 5–10 tags covering themes, settings, narrative techniques, and literary movements. Each tag should be 1–3 words. Return as a JSON array."
- Documents: "Given this paper's title, authors, abstract, and keywords, generate 5–10 subject tags at the granularity of Library of Congress Subject Headings. Return as a JSON array."
- Videos/TV shows: "Given this title, description, and genre, generate 5–10 tags covering themes, subjects, and notable aspects. Return as a JSON array."
Tags from this tier are marked assigned_by = 'llm' with the model name in classification_notes. Confidence is set based on the number of sources that agree: a tag generated by the LLM that also appeared in the API results gets higher confidence than one the LLM produced alone.
Tier 3: Anthropic Claude Haiku 4.5 (paid, fallback).
Only invoked when tier 1 + tier 2 produce fewer than 3 tags, or when the item has no summary/description for tier 2 to work from. Same prompt structure as tier 2 but with Claude's stronger reasoning for ambiguous cases. Expected to fire on less than 5% of items.
Tier 4: Manual review flag.
If all automated tiers produce fewer than 2 tags or the item is flagged as ambiguous, classification_status is set to manual_review. The admin interface highlights these items for human attention.
3.3 Worker Implementation
The pipeline is implemented as a classify_tags worker in the Macroscope worker framework (Projects/Live/workers/workers/classify_tags/worker.py). The worker accepts a payload specifying:
catalog_item_id— the item to enrichitem_type— book, document, video, tv_show, quoteforce— boolean, whether to overwrite existing tags (default false)tiers— optional array to limit which tiers run (e.g.,["api", "llm"])
The worker reads the item's current metadata from librarian_db, runs the tier escalation, writes new tags to the tags and catalog_item_tags tables, updates classification_status and classification_notes, and returns a result summary. The worker is idempotent: running it twice on the same item with force=false skips tags that already exist.
A batch orchestration script (PHP or Python) iterates over catalog items matching a filter (e.g., all books with classification_status = 'pending') and enqueues one classify_tags job per item. The worker framework's single-concurrency dispatcher processes them sequentially, respecting API rate limits naturally.
3.4 Summary Normalization
Alongside tag enrichment, the pipeline includes an optional summary normalization pass. Items whose summaries are identified as marketing copy (heuristics: exclamation marks, superlatives, second-person address) are flagged for AI rewrite. The LLM prompt requests a 3–5 sentence catalog-tone summary: neutral, informative, describing the work's content and significance without promotional language. This follows the pattern established by the original Librarian's AI summary normalization.
4. The Semantic Explorer as Research Instrument
4.1 Decoupling from Quotes
The semantic explorer currently lives at Projects/Live/Galatea/CNL/projects/Quotes/ and queries quotes_db exclusively. To serve the full catalog, the explorer's backend (api/graph.php) and frontend (six JavaScript modules) will be relocated to a shared location within the Collaboratory (Projects/Workbench/Collaboratory/librarian/explorer/) and rewritten to query the unified tag architecture in librarian_db.
The core algorithm is item-type agnostic. It builds a topology from two inputs: a set of tags with item counts, and a set of edges representing co-occurrence (how often two tags appear together on the same item). The graph endpoint computes these from the catalog_item_tags junction table. The JavaScript visualization code — force-directed layout, zoom levels, tendril system, camera behavior — requires no changes to the rendering logic.
4.2 Scope Filtering
The graph endpoint accepts a scope parameter that filters the co-occurrence computation by item type:
scope=all— Entire catalog topology. Tags from every item type, co-occurrences computed across the full collection. This is the most revealing view because it exposes cross-type connections: "land ethic" linking Leopold's books, quotes from his writing, and academic papers about his influence.scope=book— Books only. The thematic structure of the personal library.scope=quote— Equivalent to the existing Quotes Explorer experience.scope=document— Academic papers and their subject relationships.scope=videoorscope=tv_show— Media topology.- Comma-separated combinations (e.g.,
scope=book,document) allow custom cross-type views.
The filter is a WHERE clause on the junction table join:
SELECT t1.tag_name, t2.tag_name, COUNT(*) as weight
FROM catalog_item_tags ct1
JOIN catalog_item_tags ct2
ON ct1.catalog_item_id = ct2.catalog_item_id
AND ct1.tag_id < ct2.tag_id
JOIN tags t1 ON ct1.tag_id = t1.id
JOIN tags t2 ON ct2.tag_id = t2.id
JOIN catalog_items ci ON ct1.catalog_item_id = ci.id
WHERE ci.item_type IN (/* scope filter */)
GROUP BY ct1.tag_id, ct2.tag_id
The UI presents scope as a row of toggle buttons in the explorer's control panel, alongside the existing zoom level controls and search.
4.3 The Explorer as Research Instrument
The most significant extension is integrating external search results into the explorer's topology. The external_papers table already exists in the librarian_db schema, designed to hold references that are not in the physical collection but are part of a research context. This table becomes the bridge between the personal catalog and the broader literature.
The research workflow operates as follows:
-
Pose a question. The investigator enters a research question or topic into the explorer's search interface.
-
Search local catalog. The system searches
catalog_itemsvia FULLTEXT on titles, tags, and summaries, returning items from the personal collection that are relevant. -
Search external sources. Simultaneously, a
search_semantic_scholarworker (or the existingweb_searchworker extended with S2 API support) queries Semantic Scholar for relevant papers. S2 returns structured metadata includingtitle,authors,year,abstract,fieldsOfStudy,citationCount,tldr, andexternalIds. -
Tag and insert. External results are inserted into
external_papersand run through the tag enrichment pipeline (tier 1 uses S2's ownfieldsOfStudy; tier 2 generates additional tags from the abstract via gemma4:31b-cloud). Tags are written to the sametagsandcatalog_item_tagstables, withassigned_by = 'api'for S2 fields andassigned_by = 'llm'for generated tags. -
Render merged topology. The explorer renders local and external items in the same graph, distinguished by visual treatment (color, shape, or opacity). The investigator can see where their existing knowledge is dense, where it is thin, and where the external literature fills gaps or opens new directions.
-
Promote to catalog. If an external paper is sufficiently interesting, the investigator can promote it to a full catalog item (downloading the PDF, running it through the document extraction pipeline, creating a proper
catalog_items+items_documentrecord). This is a one-click operation that moves the reference fromexternal_papersinto the permanent collection.
This workflow connects directly to the SWC investigation framework. The "Priors" phase of an investigation (step 3 in the ten-step model from CNL-TN-2026-047 v0.3) is precisely this operation: surveying existing knowledge and identifying gaps. The explorer becomes the visual interface for that phase, with the investigation context (site, time window, domain) serving as implicit scope filters.
5. Schema Changes
5.1 New Tables in librarian_db
-- Normalized tag definitions (replaces flat catalog_tags)
CREATE TABLE tags (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
tag_name VARCHAR(128) NOT NULL,
slug VARCHAR(128) NOT NULL,
description TEXT NULL,
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
UNIQUE KEY uk_tag_name (tag_name),
UNIQUE KEY uk_slug (slug)
) ENGINE=InnoDB;
-- Universal junction table with provenance
CREATE TABLE catalog_item_tags (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
catalog_item_id BIGINT UNSIGNED NOT NULL,
tag_id BIGINT UNSIGNED NOT NULL,
assigned_by ENUM('manual','llm','api','import') NOT NULL DEFAULT 'manual',
confidence DECIMAL(3,2) NULL COMMENT '0.00-1.00, NULL for manual',
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
UNIQUE KEY uk_item_tag (catalog_item_id, tag_id),
KEY idx_tag_id (tag_id),
KEY idx_assigned_by (assigned_by),
FOREIGN KEY (catalog_item_id) REFERENCES catalog_items(id) ON DELETE CASCADE,
FOREIGN KEY (tag_id) REFERENCES tags(id) ON DELETE CASCADE
) ENGINE=InnoDB;
-- Quote extension table
CREATE TABLE items_quote (
catalog_item_id BIGINT UNSIGNED NOT NULL PRIMARY KEY,
quote_text TEXT NOT NULL,
attribution VARCHAR(512) NULL COMMENT 'Person to whom the quote is attributed',
source_work VARCHAR(512) NULL COMMENT 'Book, speech, article, etc.',
source_detail VARCHAR(512) NULL COMMENT 'Page, chapter, timestamp, URL',
context TEXT NULL COMMENT 'Surrounding context or circumstances',
language VARCHAR(16) NULL DEFAULT 'en',
verified BOOLEAN NOT NULL DEFAULT FALSE COMMENT 'Quote verified against source',
chat_history_json LONGTEXT NULL COMMENT 'AI tagger conversation history',
FOREIGN KEY (catalog_item_id) REFERENCES catalog_items(id) ON DELETE CASCADE,
FULLTEXT KEY ft_quote (quote_text, attribution, source_work)
) ENGINE=InnoDB;
5.2 Migration Path
The migration from quotes_db to librarian_db proceeds in four steps:
Step 1: Tag migration. Copy all rows from quotes_db.tags into librarian_db.tags. Preserve tag_name, slug, description. Record the old-to-new ID mapping for junction table migration.
Step 2: Quote migration. For each quote in quotes_db.quotes, create a catalog_items row (item_type = 'quote', mapping title from attribution + first words, creators from attribution, year from source date if available) and an items_quote row (quote_text, attribution, source_work, source_detail, context, verified status). Record the old quote ID to new catalog_item_id mapping.
Step 3: Junction table migration. For each row in quotes_db.quote_tags, create a corresponding catalog_item_tags row using the ID mappings from steps 1 and 2. Preserve the assigned_by provenance.
Step 4: Existing book tag migration. Parse the comma-separated catalog_items.tags field for all existing books. For each unique tag string, find or create a row in librarian_db.tags. Create catalog_item_tags junction rows with assigned_by = 'import'. This preserves the BookBuddy tags (which will later be replaced by the enrichment pipeline) while establishing the new architecture.
5.3 Tables Retained
The items_document.disciplines_json field is retained as a supplementary classification layer specific to academic documents. It serves a different purpose than tags — it captures the formal disciplinary classification of a paper (e.g., "Computer Science > Networking") rather than topical keywords. The enrichment pipeline will generate tags from disciplines_json content where it exists, bridging the two systems.
The items_book fields ddc and lcc are retained as formal library classification identifiers. The enrichment pipeline's tier 1 (Open Library API) will attempt to fill these where they are empty.
6. Implementation Sequence
The work is sequenced into phases, each a testable increment, following the pattern established in CNL-TN-2026-047 v0.3.
Phase 1 — Schema deployment. Execute the SQL from Section 5.1 in phpMyAdmin on Data. Add items_quote extension table. Create tags and catalog_item_tags tables. Verify foreign keys and indexes.
Phase 2 — Tag migration. Run migration script to move existing book tags from the flat catalog_items.tags field into the new tags + catalog_item_tags architecture. Run quotes migration from quotes_db. Verify counts and provenance marking.
Phase 3 — classify_tags worker. Build the worker at Projects/Live/workers/workers/classify_tags/worker.py with the four-tier escalation. Start with tier 1 (Google Books + Open Library for books) and tier 2 (gemma4:31b-cloud). Test on a sample of 20 books before batch processing.
Phase 4 — Batch enrichment. Run classify_tags across all 1,311 books. Review results. Identify normalization opportunities (synonym merging, preferred forms). Run manual review pass on flagged items.
Phase 5 — Admin UI updates. Extend the Librarian admin edit form to use the new tag architecture: tag autocomplete from the tags table, provenance display, bulk tag management. Add the tabbed edit form for books (Basic Info, Details, Library Info, Notes & Media) to expose all items_book fields.
Phase 6 — Explorer relocation. Move the semantic explorer from CNL/projects/Quotes/ to Collaboratory/librarian/explorer/. Rewrite api/graph.php to query librarian_db with scope filtering. Set up redirect from the old Quotes Explorer URL.
Phase 7 — External source integration. Extend the explorer with Semantic Scholar search. Build or extend a worker for S2 API queries. Implement the merged topology view with local/external visual distinction. Connect to the SWC investigation "Priors" workflow.
7. Relationship to Other Documents
| Document | Relationship |
|---|---|
| CNL-TN-2026-042 | STRATA/MNG Convergence Plan — the Librarian is a convergence consumer; its tag taxonomy may feed MNG's category model |
| CNL-TN-2026-043 | STRATA 2.0 Architecture — the Librarian implements the Lab Bench catalog layer |
| CNL-TN-2026-047 v0.3 | Collaboratory Architecture — the Librarian uses the same worker framework and MCP substrate |
| CNL-FN-2025-015 | Quotes Explorer Conceptual Framework — the "House of Mind" metaphor extends to the full catalog |
Table 1. Related documents in the CNL technical note series.
8. Risks and Open Questions
Tag explosion. A catalog of 1,300+ books, hundreds of quotes, and growing document and video collections could generate thousands of distinct tags. Without normalization discipline, the tag space becomes noisy and the explorer topology becomes an undifferentiated mesh. Mitigation: regular synonym-merging passes, minimum co-occurrence thresholds in the explorer, and editorial review of LLM-generated tags before acceptance.
API rate limits. Google Books and Open Library have rate limits that will constrain batch enrichment speed. The worker framework's sequential processing provides natural throttling, but a full 1,311-book enrichment pass may take several hours. This is acceptable for a one-time operation.
Quote identity in the catalog. Quotes are fundamentally different from other catalog items: they are short, they are fragments of other works, and their "title" is often the first few words of the quote itself. The catalog_items supertable assumes every item has a meaningful title and creator. The migration will need to construct these fields from quote metadata in a way that is useful for search and display without being misleading.
Explorer performance at scale. The Three.js explorer was built for the Quotes collection (hundreds of items, dozens of tags). A full-catalog topology with thousands of items and potentially hundreds of tags will require performance optimization: lazy edge computation, level-of-detail rendering, and possibly server-side pre-computation of the graph layout for large scopes.
Semantic Scholar integration scope. S2's API returns papers, not books or media. The external search integration is initially limited to the academic literature. Extending it to book discovery (via Google Books or Open Library search) and media discovery (via TMDB) would require additional worker types and normalization logic. This is deferred to a future phase.
Document History
| Version | Date | Changes |
|---|---|---|
| 0.1 | 2026-04-13 | Initial draft. Unified catalog architecture, tag classification pipeline, semantic explorer generalization, quotes migration path, schema changes, implementation sequence. |
Cite This Document
BibTeX
Permanent URL: https://canemah.org/archive/document.php?id=CNL-TN-2026-051