CNL-TN-2026-032 Technical Note

YEA Place Catalog: Batch Import and AI Enrichment Infrastructure

Michael P. Hamilton , Ph.D.
Published: March 6, 2026 Version: 1

YEA Place Catalog: Batch Import and AI Enrichment Infrastructure

Document ID: CNL-TN-2026-031
Version: 1.0
Date: March 6, 2026
Author: Michael P. Hamilton, Ph.D.

AI Assistance Disclosure: This technical note was developed through working dialogue with Claude (Anthropic, Opus 4). Claude contributed to the design and implementation of the scripts documented here and drafted this note from session transcripts. The author takes full responsibility for the content.


Abstract

The Your Ecological Address (YEA) platform at yea.earth maintains a curated catalog of ecologically significant places worldwide — biological field stations, nature reserves, bird observatories, wildlife refuges, and similar sites. As of March 2026, the catalog contains 1,142 published places across six continents, affiliated with 23 organizations. This note documents the batch infrastructure that builds and enriches this catalog: the import pipeline that ingests structured data from research network registries, and the AI enrichment pipeline that produces research-grade narrative descriptions for each place using a tiered model architecture. Together these scripts run on a two-machine LAN configuration (Data for computation, Galatea for production MySQL) and have processed the entire catalog at near-zero marginal cost.


1. System Architecture

1.1 Hardware

Machine Role Hardware Location
Data Development, batch processing, local AI inference MacBook Pro M4 Max, 128GB RAM Development workstation
Galatea Production server, MySQL database, Apache Mac Mini M4 Pro, 1Gb fiber galatea.local on LAN

All batch scripts run on Data and write to Galatea's MySQL over LAN. This keeps heavy AI inference off the production server while maintaining direct database access.

1.2 Database

  • Server: MySQL 8.4 on Galatea
  • Binary path: /usr/local/mysql-8.4.5-macos15-arm64/bin/mysql
  • Database: ecological_address
  • Remote access: Granted to root@192.168.6.% for LAN connections from Data
  • Primary tables:
    • yea_places — Core place records (name, slug, coordinates, category, all enrichment fields)
    • yea_organizations — Research networks and managing bodies
    • yea_place_organizations — Many-to-many linking places to organizations
    • yea_place_types — Taxonomy of place types (33 types defined)
    • yea_place_type_links — Many-to-many linking places to types
    • yea_place_sources — Research source URLs discovered during enrichment

1.3 Script Location

All batch scripts live in ~/scripts/ on Data:

Script Purpose
batch-enrich-local.php Ollama-based AI enrichment (Tier 1)
import-bird-observatories.php Bird observatory import
import-audubon-centers.php National Audubon Centers import
geocode-observatories.php Coordinate extraction via Ollama
bird_observatories.csv Source data for bird observatory import

Server-side enrichment scripts on Galatea:

Script Path Purpose
enrich.php /projects/yea3d/api/places/enrich.php REST API for single-place enrichment
place-enrich.php /projects/yea3d/admin/place-enrich.php Admin UI for manual enrichment
batch-enrich.php /projects/yea3d/admin/batch-enrich.php Server-side CLI batch enrichment (Anthropic API)

1.4 Credentials

  • MySQL: Configured in each script ($DB_HOST, $DB_USER, $DB_PASS, $DB_NAME)
  • Ollama API key: Hardcoded in batch-enrich-local.php for web search and fetch
  • Anthropic API key: Stored in /Library/WebServer/secure/credentials/ecoaddress/ai-config.php on Galatea
  • Ollama API key (server): Also in ai-config.php as OLLAMA_API_KEY

2. Import Pipeline

2.1 Pattern

Every network import follows a consistent two-phase pattern:

Phase 1: Data Acquisition. A scraper script or manual data collection produces a structured dataset (CSV or PHP array) with place names, coordinates, state/province, country, website URLs, and any available metadata. Sources include network registry websites (OBFS, ILTER/DEIMS-SDR), federal APIs (PAD-US, WDPA), or curated lists.

Phase 2: Import Script. A PHP CLI script reads the dataset and for each place: generates a URL slug, checks for duplicate slugs, checks for geographic proximity to existing places (Haversine distance, typically 3-5km threshold), creates the organization if it doesn't exist, inserts the place record, and links it to the organization via yea_place_organizations.

2.2 Import Script Structure

All import scripts share this skeleton:

php import-[network].php              # dry run — shows what would be inserted
php import-[network].php --commit     # writes to database

Key sections:

  1. Config — Database credentials, commit flag
  2. Data array — Hardcoded place records (name, state, country, lat, lon, url)
  3. Organization creation — Insert or find the managing organization
  4. Slug helpermakeSlug() generates URL-safe identifiers
  5. Import loop — For each place: check slug uniqueness, check proximity, insert, link to org
  6. Summary — Report inserted/skipped counts and total catalog size

2.3 Duplicate Detection

Two mechanisms prevent duplicates:

  • Slug check: If a place with the same slug already exists, skip
  • Proximity check: Haversine distance against all existing places. If another place is within the threshold radius, log it. If name similarity exceeds 60% (via similar_text()), skip as likely duplicate. Otherwise insert as a distinct nearby site.

2.4 Completed Imports

Network Script Places Organization
OBFS import-obfs.php 194 Organization of Biological Field Stations
UC NRS import-ucnrs.php 43 UC Natural Reserve System
LTER manual 28 NSF Long Term Ecological Research
NEON manual 58 National Ecological Observatory Network
TNC manual 86 The Nature Conservancy
ILTER import-ilter.php 673 International Long Term Ecological Research
Bird Observatories import-bird-observatories.php 52 Bird Observatories of North America
Audubon Centers import-audubon-centers.php 46 National Audubon Society

2.5 Geocoding

For networks without coordinates in their registry data (e.g., bird observatories from Merry's web directory), a geocoding script uses Ollama's web search and fetch APIs to extract coordinates:

php geocode-observatories.php              # process all
php geocode-observatories.php --limit 5    # test first 5
php geocode-observatories.php --skip-existing

The geocoder for each observatory: runs a web search for the name + location, fetches the observatory's website if available, sends the collected context to gpt-oss:20b with instructions to extract lat/lon, validates the coordinates (range check, hemisphere check), checks for proximity duplicates against existing places, and writes results to a CSV for human review before import.

A Mapbox-based HTML verification tool (verify_observatories.html) provides a split-panel interface with the place list on the left and a satellite map on the right. Clicking any place flies the map to its coordinates for visual verification.


3. AI Enrichment Pipeline

3.1 Tiered Model Architecture

The enrichment system uses three tiers, each producing progressively richer content:

Tier Model Cost Use Case Places Enriched
Tier 1 Ollama gpt-oss:20b (local) $0.00 Baseline enrichment for all places 1,142
Tier 2 Claude Haiku + web search ~$0.71/place Deeper research with peer-reviewed sources ~360
Tier 3 Claude Sonnet/Opus ~$2-5/place Ultra-enrichment for high-priority sites On demand

The tiered approach means every place has at minimum a solid ecological profile, and curators can selectively upgrade sites that warrant deeper research.

3.2 Enrichment Fields

The enrichment system populates these fields on yea_places:

Field Type Description
site_abstract text 150-250 word encyclopedic summary
place_description text 80-150 word physical landscape description
stewardship_description text Conservation programs, restoration work
history_description text Indigenous presence, ownership lineage, protection timeline
facilities_description text Buildings, labs, housing, trails
access_description text Directions, seasonal access, fees, permits
established_date date ISO format YYYY-MM-DD
area_acres float Total area in acres
elevation_range varchar e.g., "1,200-3,800 ft"
wikipedia_url varchar Most specific Wikipedia article
website varchar Official website (verified/corrected)

Research sources are saved to yea_place_sources with URL, domain, source type, and attribution (added_by = 'ollama-batch', 'haiku-batch', etc.).

3.3 Local Batch Enrichment (batch-enrich-local.php)

This is the primary enrichment tool. Runs on Data, uses Ollama locally for inference and Ollama's cloud API for web search.

Usage:

php batch-enrich-local.php                    # dry run
php batch-enrich-local.php --commit           # process all unenriched
php batch-enrich-local.php --commit --limit 5 # test batch
php batch-enrich-local.php --commit --id 411  # single place
php batch-enrich-local.php --status           # show queue stats

Queue logic: Selects all published places where site_abstract IS NULL OR site_abstract = ''. The --id flag overrides this to process a specific place regardless of enrichment status (useful for re-enrichment).

Processing loop for each place:

  1. Build context — Assemble known metadata: name, state, country, coordinates, category, website, Wikipedia URL, affiliated organizations, place types
  2. Web search — Execute 3-4 targeted queries via Ollama web search API:
    • Place name alone
    • Place name + organization + "ecology"
    • Place name + "history established conservation"
    • Place name + website hostname (if available)
  3. Website fetch — Fetch the place's own website via Ollama web fetch API (truncated to 8,000 characters)
  4. Context truncation — If total search context exceeds 12,000 characters, truncate to keep the prompt within the model's effective attention window
  5. JSON enforcement — Append explicit instructions to respond only with JSON
  6. Model call — Send system prompt + user prompt + search context to local gpt-oss:20b (32K context, temperature 0.3)
  7. Retry on failure — If the model returns nothing, retry once after 3 seconds
  8. JSON parsing — Strip markdown fences, find balanced JSON object, decode
  9. Citation cleanupstripCitations() removes reference markup that models sometimes insert despite instructions
  10. Database save — Build dynamic UPDATE statement for non-null fields, execute with prepared statement
  11. Source save — Insert research source URLs into yea_place_sources with deduplication

Error handling:

  • MySQL connection reconnect helper (ensureConnection()) using SELECT 1 test — required because LAN connections drop during long model calls
  • Per-record error isolation with try-catch — one failed record doesn't kill the batch
  • Error logging to batch-enrich-local-errors.log with timestamp, place ID, and first 500 characters of model output
  • Parse errors and model failures are counted separately in summary stats

System prompt: The system prompt defines 11 required output fields with word count guidelines, tone specifications (encyclopedic, third person, no marketing language), and the exact JSON schema expected. It explicitly prohibits citation markup in prose fields.

3.4 Server-Side Enrichment (enrich.php)

The REST API at /projects/yea3d/api/places/enrich.php supports both Anthropic models and Ollama. It's called by the admin UI (place-enrich.php) for individual place enrichment with model selection.

Anthropic path: Sends the place context as a user message with web_search tool enabled. Claude autonomously searches the web, synthesizes results, and returns structured JSON. This produces the richest results but incurs API costs ($0.71/place average with Haiku due to web search fees).

Ollama path: Same web search + model flow as the batch script but triggered via HTTP API. Used when enriching from the admin UI with the Ollama model selector.

Model selector options: Haiku, Sonnet, Opus, Ollama (gpt-oss:20b)

3.5 Server-Side Batch (batch-enrich.php)

CLI script on Galatea for Anthropic API batch enrichment:

php batch-enrich.php --model haiku --commit --limit 10
php batch-enrich.php --model haiku --commit --id 411
php batch-enrich.php --status

Same pattern as the local script but uses Anthropic's API with web search tool instead of Ollama. Significantly more expensive due to web search per-request fees (~$0.07/search, 7-12 searches per international station).

3.6 Performance Characteristics

Ollama local (Tier 1):

  • Speed: 24-90 seconds per place (average ~35s on M4 Max)
  • Cost: $0.00 (local inference + $20/month Ollama Pro for web search quota)
  • Quality: Solid ecological profiles with 2-5 sources per place
  • Failure rate: ~1.5% (model failures, retryable)

Haiku + web search (Tier 2):

  • Speed: ~45 seconds per place
  • Cost: ~$0.71/place ($0.07/search × 7-12 searches + token costs)
  • Quality: Research-grade with peer-reviewed sources, specific measurements, detailed facilities
  • Failure rate: <1%

3.7 Ollama Configuration

  • Model: gpt-oss:20b (14GB VRAM, runs 100% GPU on M4 Max)
  • Context: 32,768 tokens (reduced from default to improve focus)
  • Temperature: 0.3 (low for factual consistency)
  • Ollama Pro subscription: $20/month, provides sufficient web search quota for batch runs
  • Web search API: https://ollama.com/api/web_search (requires Bearer token)
  • Web fetch API: https://ollama.com/api/web_fetch (separate quota from search)
  • Known issue: Other Ollama models (e.g., gemma3:12b) may auto-load and consume GPU memory. Use ollama ps to check and ollama stop [model] to unload before batch runs.

4. Place Taxonomy

4.1 Categories

The category column on yea_places stores a single flat classification:

Category Count Description
field_station 953 Biological field stations, research sites
nature_preserve 82 Protected natural areas, TNC preserves
bird_observatory 52 Banding stations, hawk watches, migration monitoring
nature_sanctuary 46 Audubon centers, wildlife sanctuaries

4.2 Place Types (Many-to-Many)

The yea_place_types and yea_place_type_links tables support multiple type tags per place. 33 types are defined; the most populated are:

Type ID Places Tagged
Long-Term Ecological Research Site 3 700
Biological Field Station 1 194
Nature Preserve 13 87
NEON Field Site 33 58
Bird Observatory 2 52
Wildlife Sanctuary 17 46
Natural Reserve 14 40

4.3 Organizations

Organization ID Abbreviation Places
ILTER ILTER 673
OBFS OBFS 194
The Nature Conservancy TNC 86
NSF NEON NEON 58
Bird Observatories of North America 22 BONA 52
National Audubon Society 23 Audubon 46
UC Natural Reserve System UC NRS 43
NSF LTER LTER 28

5. Admin Interface

5.1 Dashboard

The admin dashboard at /projects/yea3d/admin/index.php displays:

  • System Overview: Cache entries, total queries, unique locations, cached narratives, average response time
  • Curated Places: Total places, published, enriched, with narrative, organizations, network affiliations, research sources, places needing enrichment
  • Research Networks: Places per organization with bar charts
  • Coverage by Country: Top 10 countries/regions
  • Daily Query Volume: 14-day chart
  • Cache Health: Row counts, oldest/newest entries per source
  • Source Reliability: Hit/cached/error rates per data source
  • Cache Management: Purge controls for weather, narrative, source-specific, and full cache

5.2 Place Enrichment UI

The admin place enrichment page (place-enrich.php) provides:

  • Model selector dropdown (Haiku, Sonnet, Opus, Ollama)
  • Per-station researcher guidance injection
  • Side-by-side comparison of existing vs proposed enrichment
  • Field-by-field accept/reject controls
  • Source URL management

5.3 Map Pin Management

The admin UI includes a map manager for refining place coordinates. Curators can search Mapbox for locations, drag map pins to exact positions, and adjust the radius for each place. This is the preferred method for coordinate refinement after batch imports.


6. Operational Procedures

6.1 Adding a New Network

  1. Identify the data source (registry, API, or curated list)
  2. Build or acquire a structured dataset with names, coordinates, and metadata
  3. Create an import script following the established pattern (see Section 2.2)
  4. Dry run to verify no unexpected duplicates
  5. Commit the import
  6. Run batch-enrich-local.php --commit to enrich all new places
  7. Update the admin dashboard if new metrics are needed

6.2 Running a Batch Enrichment

  1. Verify Ollama is running only gpt-oss:20b: ollama ps
  2. Unload other models if present: ollama stop [model]
  3. Check the queue: php batch-enrich-local.php --status
  4. Test with one record: php batch-enrich-local.php --commit --id [id]
  5. Launch the batch: php batch-enrich-local.php --commit
  6. Monitor progress in terminal (prints per-record status)
  7. If connection drops, the script resumes from where it left off (only processes places with empty site_abstract)

6.3 Re-enriching a Place

To upgrade a place from Tier 1 to Tier 2:

  • Use the admin UI model selector to choose Haiku or Sonnet
  • Or use the CLI: php batch-enrich.php --model haiku --commit --id [id]

To re-enrich with Ollama (e.g., after improving the system prompt):

  • php batch-enrich-local.php --commit --id [id] (the --id flag bypasses the empty-abstract check)

6.4 Troubleshooting

MySQL "server has gone away": The LAN connection drops during long model calls. The ensureConnection() function handles reconnection automatically. If it persists, check that Galatea's MySQL wait_timeout is adequate.

Model failures: Usually caused by context overflow. The 12,000-character truncation limit on search context prevents most failures. Retry individually — the model is non-deterministic and usually succeeds on a second attempt.

Parse errors: The model occasionally returns prose instead of JSON, especially with very long search contexts. The JSON enforcement reminder at the end of the prompt and the context truncation together keep the failure rate below 2%.

Ollama web search quota: With Ollama Pro ($20/month), session limits reset every few hours and weekly limits are generous. Monitor at https://ollama.com/settings. If rate-limited, reduce searches per station or wait for reset.

gemma3 auto-loading: Check ollama ps before batch runs. If gemma3:12b appears, ollama stop gemma3:12b to free GPU memory for gpt-oss:20b.


7. Cost Summary

7.1 Catalog Build Costs

Component Cost
Ollama Pro (1 month, web search quota) $20.00
Haiku batch enrichment (63 stations) $45.00
Anthropic API (total account spend) $273.49
Claude Pro subscription (monthly) $100.00
Total infrastructure cost ~$438

7.2 Per-Place Economics

Operation Cost
Import (any network) $0.00
Tier 1 enrichment (Ollama) $0.00
Tier 2 enrichment (Haiku + web search) ~$0.71/place
Tier 3 enrichment (Sonnet) ~$2-5/place
Geocoding (Ollama) $0.00

8. Future Development

8.1 Planned Imports

  • USFS Research Natural Areas (~450 places) via ArcGIS REST feature services
  • USFWS National Wildlife Refuges (~300 filtered) via FWS cadastral data
  • BLM Areas of Critical Environmental Concern via BLM GIS Hub
  • NPS units with active research programs (~40 selective)

See CNL-FN-2026-030 for the complete federal lands import roadmap.

8.2 Platform Enhancements

  • YEA Lab: Data science portal with faceted search, cross-place comparison, and advanced filtering by organization, type, country, and enrichment tier
  • Search improvement: Add local yea_places and yea_organizations search to the field guide search bar, ranked above Wikipedia and Mapbox results
  • Badge filters: Category/organization filter badges on the globe view to partition the 1,142-place catalog into explorable subsets
  • Guided tour: Narrated flythrough to representative curated places

Document History

Version Date Changes
1.0 2026-03-06 Initial release

Cite This Document

Michael P. Hamilton, Ph.D. (2026). "YEA Place Catalog: Batch Import and AI Enrichment Infrastructure." Canemah Nature Laboratory Technical Note CNL-TN-2026-032. https://canemah.org/archive/CNL-TN-2026-032

BibTeX

@techreport{hamilton2026yea, author = {Hamilton, Michael P., Ph.D.}, title = {YEA Place Catalog: Batch Import and AI Enrichment Infrastructure}, institution = {Canemah Nature Laboratory}, year = {2026}, number = {CNL-TN-2026-032}, month = {march}, url = {https://canemah.org/archive/document.php?id=CNL-TN-2026-032}, abstract = {The Your Ecological Address (YEA) platform at `yea.earth` maintains a curated catalog of ecologically significant places worldwide — biological field stations, nature reserves, bird observatories, wildlife refuges, and similar sites. As of March 2026, the catalog contains 1,142 published places across six continents, affiliated with 23 organizations. This note documents the batch infrastructure that builds and enriches this catalog: the import pipeline that ingests structured data from research network registries, and the AI enrichment pipeline that produces research-grade narrative descriptions for each place using a tiered model architecture. Together these scripts run on a two-machine LAN configuration (Data for computation, Galatea for production MySQL) and have processed the entire catalog at near-zero marginal cost.} }

Permanent URL: https://canemah.org/archive/document.php?id=CNL-TN-2026-032