YEA Place Catalog: Batch Import and AI Enrichment Infrastructure
YEA Place Catalog: Batch Import and AI Enrichment Infrastructure
Document ID: CNL-TN-2026-031
Version: 1.0
Date: March 6, 2026
Author: Michael P. Hamilton, Ph.D.
AI Assistance Disclosure: This technical note was developed through working dialogue with Claude (Anthropic, Opus 4). Claude contributed to the design and implementation of the scripts documented here and drafted this note from session transcripts. The author takes full responsibility for the content.
Abstract
The Your Ecological Address (YEA) platform at yea.earth maintains a curated catalog of ecologically significant places worldwide — biological field stations, nature reserves, bird observatories, wildlife refuges, and similar sites. As of March 2026, the catalog contains 1,142 published places across six continents, affiliated with 23 organizations. This note documents the batch infrastructure that builds and enriches this catalog: the import pipeline that ingests structured data from research network registries, and the AI enrichment pipeline that produces research-grade narrative descriptions for each place using a tiered model architecture. Together these scripts run on a two-machine LAN configuration (Data for computation, Galatea for production MySQL) and have processed the entire catalog at near-zero marginal cost.
1. System Architecture
1.1 Hardware
| Machine | Role | Hardware | Location |
|---|---|---|---|
| Data | Development, batch processing, local AI inference | MacBook Pro M4 Max, 128GB RAM | Development workstation |
| Galatea | Production server, MySQL database, Apache | Mac Mini M4 Pro, 1Gb fiber | galatea.local on LAN |
All batch scripts run on Data and write to Galatea's MySQL over LAN. This keeps heavy AI inference off the production server while maintaining direct database access.
1.2 Database
- Server: MySQL 8.4 on Galatea
- Binary path:
/usr/local/mysql-8.4.5-macos15-arm64/bin/mysql - Database:
ecological_address - Remote access: Granted to
root@192.168.6.%for LAN connections from Data - Primary tables:
yea_places— Core place records (name, slug, coordinates, category, all enrichment fields)yea_organizations— Research networks and managing bodiesyea_place_organizations— Many-to-many linking places to organizationsyea_place_types— Taxonomy of place types (33 types defined)yea_place_type_links— Many-to-many linking places to typesyea_place_sources— Research source URLs discovered during enrichment
1.3 Script Location
All batch scripts live in ~/scripts/ on Data:
| Script | Purpose |
|---|---|
batch-enrich-local.php |
Ollama-based AI enrichment (Tier 1) |
import-bird-observatories.php |
Bird observatory import |
import-audubon-centers.php |
National Audubon Centers import |
geocode-observatories.php |
Coordinate extraction via Ollama |
bird_observatories.csv |
Source data for bird observatory import |
Server-side enrichment scripts on Galatea:
| Script | Path | Purpose |
|---|---|---|
enrich.php |
/projects/yea3d/api/places/enrich.php |
REST API for single-place enrichment |
place-enrich.php |
/projects/yea3d/admin/place-enrich.php |
Admin UI for manual enrichment |
batch-enrich.php |
/projects/yea3d/admin/batch-enrich.php |
Server-side CLI batch enrichment (Anthropic API) |
1.4 Credentials
- MySQL: Configured in each script (
$DB_HOST,$DB_USER,$DB_PASS,$DB_NAME) - Ollama API key: Hardcoded in
batch-enrich-local.phpfor web search and fetch - Anthropic API key: Stored in
/Library/WebServer/secure/credentials/ecoaddress/ai-config.phpon Galatea - Ollama API key (server): Also in
ai-config.phpasOLLAMA_API_KEY
2. Import Pipeline
2.1 Pattern
Every network import follows a consistent two-phase pattern:
Phase 1: Data Acquisition. A scraper script or manual data collection produces a structured dataset (CSV or PHP array) with place names, coordinates, state/province, country, website URLs, and any available metadata. Sources include network registry websites (OBFS, ILTER/DEIMS-SDR), federal APIs (PAD-US, WDPA), or curated lists.
Phase 2: Import Script. A PHP CLI script reads the dataset and for each place: generates a URL slug, checks for duplicate slugs, checks for geographic proximity to existing places (Haversine distance, typically 3-5km threshold), creates the organization if it doesn't exist, inserts the place record, and links it to the organization via yea_place_organizations.
2.2 Import Script Structure
All import scripts share this skeleton:
php import-[network].php # dry run — shows what would be inserted
php import-[network].php --commit # writes to database
Key sections:
- Config — Database credentials, commit flag
- Data array — Hardcoded place records (name, state, country, lat, lon, url)
- Organization creation — Insert or find the managing organization
- Slug helper —
makeSlug()generates URL-safe identifiers - Import loop — For each place: check slug uniqueness, check proximity, insert, link to org
- Summary — Report inserted/skipped counts and total catalog size
2.3 Duplicate Detection
Two mechanisms prevent duplicates:
- Slug check: If a place with the same slug already exists, skip
- Proximity check: Haversine distance against all existing places. If another place is within the threshold radius, log it. If name similarity exceeds 60% (via
similar_text()), skip as likely duplicate. Otherwise insert as a distinct nearby site.
2.4 Completed Imports
| Network | Script | Places | Organization |
|---|---|---|---|
| OBFS | import-obfs.php |
194 | Organization of Biological Field Stations |
| UC NRS | import-ucnrs.php |
43 | UC Natural Reserve System |
| LTER | manual | 28 | NSF Long Term Ecological Research |
| NEON | manual | 58 | National Ecological Observatory Network |
| TNC | manual | 86 | The Nature Conservancy |
| ILTER | import-ilter.php |
673 | International Long Term Ecological Research |
| Bird Observatories | import-bird-observatories.php |
52 | Bird Observatories of North America |
| Audubon Centers | import-audubon-centers.php |
46 | National Audubon Society |
2.5 Geocoding
For networks without coordinates in their registry data (e.g., bird observatories from Merry's web directory), a geocoding script uses Ollama's web search and fetch APIs to extract coordinates:
php geocode-observatories.php # process all
php geocode-observatories.php --limit 5 # test first 5
php geocode-observatories.php --skip-existing
The geocoder for each observatory: runs a web search for the name + location, fetches the observatory's website if available, sends the collected context to gpt-oss:20b with instructions to extract lat/lon, validates the coordinates (range check, hemisphere check), checks for proximity duplicates against existing places, and writes results to a CSV for human review before import.
A Mapbox-based HTML verification tool (verify_observatories.html) provides a split-panel interface with the place list on the left and a satellite map on the right. Clicking any place flies the map to its coordinates for visual verification.
3. AI Enrichment Pipeline
3.1 Tiered Model Architecture
The enrichment system uses three tiers, each producing progressively richer content:
| Tier | Model | Cost | Use Case | Places Enriched |
|---|---|---|---|---|
| Tier 1 | Ollama gpt-oss:20b (local) | $0.00 | Baseline enrichment for all places | 1,142 |
| Tier 2 | Claude Haiku + web search | ~$0.71/place | Deeper research with peer-reviewed sources | ~360 |
| Tier 3 | Claude Sonnet/Opus | ~$2-5/place | Ultra-enrichment for high-priority sites | On demand |
The tiered approach means every place has at minimum a solid ecological profile, and curators can selectively upgrade sites that warrant deeper research.
3.2 Enrichment Fields
The enrichment system populates these fields on yea_places:
| Field | Type | Description |
|---|---|---|
site_abstract |
text | 150-250 word encyclopedic summary |
place_description |
text | 80-150 word physical landscape description |
stewardship_description |
text | Conservation programs, restoration work |
history_description |
text | Indigenous presence, ownership lineage, protection timeline |
facilities_description |
text | Buildings, labs, housing, trails |
access_description |
text | Directions, seasonal access, fees, permits |
established_date |
date | ISO format YYYY-MM-DD |
area_acres |
float | Total area in acres |
elevation_range |
varchar | e.g., "1,200-3,800 ft" |
wikipedia_url |
varchar | Most specific Wikipedia article |
website |
varchar | Official website (verified/corrected) |
Research sources are saved to yea_place_sources with URL, domain, source type, and attribution (added_by = 'ollama-batch', 'haiku-batch', etc.).
3.3 Local Batch Enrichment (batch-enrich-local.php)
This is the primary enrichment tool. Runs on Data, uses Ollama locally for inference and Ollama's cloud API for web search.
Usage:
php batch-enrich-local.php # dry run
php batch-enrich-local.php --commit # process all unenriched
php batch-enrich-local.php --commit --limit 5 # test batch
php batch-enrich-local.php --commit --id 411 # single place
php batch-enrich-local.php --status # show queue stats
Queue logic: Selects all published places where site_abstract IS NULL OR site_abstract = ''. The --id flag overrides this to process a specific place regardless of enrichment status (useful for re-enrichment).
Processing loop for each place:
- Build context — Assemble known metadata: name, state, country, coordinates, category, website, Wikipedia URL, affiliated organizations, place types
- Web search — Execute 3-4 targeted queries via Ollama web search API:
- Place name alone
- Place name + organization + "ecology"
- Place name + "history established conservation"
- Place name + website hostname (if available)
- Website fetch — Fetch the place's own website via Ollama web fetch API (truncated to 8,000 characters)
- Context truncation — If total search context exceeds 12,000 characters, truncate to keep the prompt within the model's effective attention window
- JSON enforcement — Append explicit instructions to respond only with JSON
- Model call — Send system prompt + user prompt + search context to local gpt-oss:20b (32K context, temperature 0.3)
- Retry on failure — If the model returns nothing, retry once after 3 seconds
- JSON parsing — Strip markdown fences, find balanced JSON object, decode
- Citation cleanup —
stripCitations()removes reference markup that models sometimes insert despite instructions - Database save — Build dynamic UPDATE statement for non-null fields, execute with prepared statement
- Source save — Insert research source URLs into
yea_place_sourceswith deduplication
Error handling:
- MySQL connection reconnect helper (
ensureConnection()) usingSELECT 1test — required because LAN connections drop during long model calls - Per-record error isolation with try-catch — one failed record doesn't kill the batch
- Error logging to
batch-enrich-local-errors.logwith timestamp, place ID, and first 500 characters of model output - Parse errors and model failures are counted separately in summary stats
System prompt: The system prompt defines 11 required output fields with word count guidelines, tone specifications (encyclopedic, third person, no marketing language), and the exact JSON schema expected. It explicitly prohibits citation markup in prose fields.
3.4 Server-Side Enrichment (enrich.php)
The REST API at /projects/yea3d/api/places/enrich.php supports both Anthropic models and Ollama. It's called by the admin UI (place-enrich.php) for individual place enrichment with model selection.
Anthropic path: Sends the place context as a user message with web_search tool enabled. Claude autonomously searches the web, synthesizes results, and returns structured JSON. This produces the richest results but incurs API costs ($0.71/place average with Haiku due to web search fees).
Ollama path: Same web search + model flow as the batch script but triggered via HTTP API. Used when enriching from the admin UI with the Ollama model selector.
Model selector options: Haiku, Sonnet, Opus, Ollama (gpt-oss:20b)
3.5 Server-Side Batch (batch-enrich.php)
CLI script on Galatea for Anthropic API batch enrichment:
php batch-enrich.php --model haiku --commit --limit 10
php batch-enrich.php --model haiku --commit --id 411
php batch-enrich.php --status
Same pattern as the local script but uses Anthropic's API with web search tool instead of Ollama. Significantly more expensive due to web search per-request fees (~$0.07/search, 7-12 searches per international station).
3.6 Performance Characteristics
Ollama local (Tier 1):
- Speed: 24-90 seconds per place (average ~35s on M4 Max)
- Cost: $0.00 (local inference + $20/month Ollama Pro for web search quota)
- Quality: Solid ecological profiles with 2-5 sources per place
- Failure rate: ~1.5% (model failures, retryable)
Haiku + web search (Tier 2):
- Speed: ~45 seconds per place
- Cost: ~$0.71/place ($0.07/search × 7-12 searches + token costs)
- Quality: Research-grade with peer-reviewed sources, specific measurements, detailed facilities
- Failure rate: <1%
3.7 Ollama Configuration
- Model: gpt-oss:20b (14GB VRAM, runs 100% GPU on M4 Max)
- Context: 32,768 tokens (reduced from default to improve focus)
- Temperature: 0.3 (low for factual consistency)
- Ollama Pro subscription: $20/month, provides sufficient web search quota for batch runs
- Web search API:
https://ollama.com/api/web_search(requires Bearer token) - Web fetch API:
https://ollama.com/api/web_fetch(separate quota from search) - Known issue: Other Ollama models (e.g., gemma3:12b) may auto-load and consume GPU memory. Use
ollama psto check andollama stop [model]to unload before batch runs.
4. Place Taxonomy
4.1 Categories
The category column on yea_places stores a single flat classification:
| Category | Count | Description |
|---|---|---|
field_station |
953 | Biological field stations, research sites |
nature_preserve |
82 | Protected natural areas, TNC preserves |
bird_observatory |
52 | Banding stations, hawk watches, migration monitoring |
nature_sanctuary |
46 | Audubon centers, wildlife sanctuaries |
4.2 Place Types (Many-to-Many)
The yea_place_types and yea_place_type_links tables support multiple type tags per place. 33 types are defined; the most populated are:
| Type | ID | Places Tagged |
|---|---|---|
| Long-Term Ecological Research Site | 3 | 700 |
| Biological Field Station | 1 | 194 |
| Nature Preserve | 13 | 87 |
| NEON Field Site | 33 | 58 |
| Bird Observatory | 2 | 52 |
| Wildlife Sanctuary | 17 | 46 |
| Natural Reserve | 14 | 40 |
4.3 Organizations
| Organization | ID | Abbreviation | Places |
|---|---|---|---|
| ILTER | — | ILTER | 673 |
| OBFS | — | OBFS | 194 |
| The Nature Conservancy | — | TNC | 86 |
| NSF NEON | — | NEON | 58 |
| Bird Observatories of North America | 22 | BONA | 52 |
| National Audubon Society | 23 | Audubon | 46 |
| UC Natural Reserve System | — | UC NRS | 43 |
| NSF LTER | — | LTER | 28 |
5. Admin Interface
5.1 Dashboard
The admin dashboard at /projects/yea3d/admin/index.php displays:
- System Overview: Cache entries, total queries, unique locations, cached narratives, average response time
- Curated Places: Total places, published, enriched, with narrative, organizations, network affiliations, research sources, places needing enrichment
- Research Networks: Places per organization with bar charts
- Coverage by Country: Top 10 countries/regions
- Daily Query Volume: 14-day chart
- Cache Health: Row counts, oldest/newest entries per source
- Source Reliability: Hit/cached/error rates per data source
- Cache Management: Purge controls for weather, narrative, source-specific, and full cache
5.2 Place Enrichment UI
The admin place enrichment page (place-enrich.php) provides:
- Model selector dropdown (Haiku, Sonnet, Opus, Ollama)
- Per-station researcher guidance injection
- Side-by-side comparison of existing vs proposed enrichment
- Field-by-field accept/reject controls
- Source URL management
5.3 Map Pin Management
The admin UI includes a map manager for refining place coordinates. Curators can search Mapbox for locations, drag map pins to exact positions, and adjust the radius for each place. This is the preferred method for coordinate refinement after batch imports.
6. Operational Procedures
6.1 Adding a New Network
- Identify the data source (registry, API, or curated list)
- Build or acquire a structured dataset with names, coordinates, and metadata
- Create an import script following the established pattern (see Section 2.2)
- Dry run to verify no unexpected duplicates
- Commit the import
- Run
batch-enrich-local.php --committo enrich all new places - Update the admin dashboard if new metrics are needed
6.2 Running a Batch Enrichment
- Verify Ollama is running only gpt-oss:20b:
ollama ps - Unload other models if present:
ollama stop [model] - Check the queue:
php batch-enrich-local.php --status - Test with one record:
php batch-enrich-local.php --commit --id [id] - Launch the batch:
php batch-enrich-local.php --commit - Monitor progress in terminal (prints per-record status)
- If connection drops, the script resumes from where it left off (only processes places with empty
site_abstract)
6.3 Re-enriching a Place
To upgrade a place from Tier 1 to Tier 2:
- Use the admin UI model selector to choose Haiku or Sonnet
- Or use the CLI:
php batch-enrich.php --model haiku --commit --id [id]
To re-enrich with Ollama (e.g., after improving the system prompt):
php batch-enrich-local.php --commit --id [id](the--idflag bypasses the empty-abstract check)
6.4 Troubleshooting
MySQL "server has gone away": The LAN connection drops during long model calls. The ensureConnection() function handles reconnection automatically. If it persists, check that Galatea's MySQL wait_timeout is adequate.
Model failures: Usually caused by context overflow. The 12,000-character truncation limit on search context prevents most failures. Retry individually — the model is non-deterministic and usually succeeds on a second attempt.
Parse errors: The model occasionally returns prose instead of JSON, especially with very long search contexts. The JSON enforcement reminder at the end of the prompt and the context truncation together keep the failure rate below 2%.
Ollama web search quota: With Ollama Pro ($20/month), session limits reset every few hours and weekly limits are generous. Monitor at https://ollama.com/settings. If rate-limited, reduce searches per station or wait for reset.
gemma3 auto-loading: Check ollama ps before batch runs. If gemma3:12b appears, ollama stop gemma3:12b to free GPU memory for gpt-oss:20b.
7. Cost Summary
7.1 Catalog Build Costs
| Component | Cost |
|---|---|
| Ollama Pro (1 month, web search quota) | $20.00 |
| Haiku batch enrichment (63 stations) | $45.00 |
| Anthropic API (total account spend) | $273.49 |
| Claude Pro subscription (monthly) | $100.00 |
| Total infrastructure cost | ~$438 |
7.2 Per-Place Economics
| Operation | Cost |
|---|---|
| Import (any network) | $0.00 |
| Tier 1 enrichment (Ollama) | $0.00 |
| Tier 2 enrichment (Haiku + web search) | ~$0.71/place |
| Tier 3 enrichment (Sonnet) | ~$2-5/place |
| Geocoding (Ollama) | $0.00 |
8. Future Development
8.1 Planned Imports
- USFS Research Natural Areas (~450 places) via ArcGIS REST feature services
- USFWS National Wildlife Refuges (~300 filtered) via FWS cadastral data
- BLM Areas of Critical Environmental Concern via BLM GIS Hub
- NPS units with active research programs (~40 selective)
See CNL-FN-2026-030 for the complete federal lands import roadmap.
8.2 Platform Enhancements
- YEA Lab: Data science portal with faceted search, cross-place comparison, and advanced filtering by organization, type, country, and enrichment tier
- Search improvement: Add local
yea_placesandyea_organizationssearch to the field guide search bar, ranked above Wikipedia and Mapbox results - Badge filters: Category/organization filter badges on the globe view to partition the 1,142-place catalog into explorable subsets
- Guided tour: Narrated flythrough to representative curated places
Document History
| Version | Date | Changes |
|---|---|---|
| 1.0 | 2026-03-06 | Initial release |
Cite This Document
BibTeX
Permanent URL: https://canemah.org/archive/document.php?id=CNL-TN-2026-032