Ethical AI Lab: Evaluating Local Ecological Summarization with Comma v0.1 on Apple MLX
Ethical AI Lab: Evaluating Local Ecological Summarization with Comma v0.1 on Apple MLX
Document ID: CNL-TN-2026-039 Version: 1.0 Date: April 2, 2026 Author: Michael P. Hamilton, Ph.D.
AI Assistance Disclosure: This technical note was developed with assistance from Claude (Anthropic, Claude Opus 4.6) operating in Cowork mode within the Claude Desktop application. Claude contributed to system architecture, code development (PHP, Python, JavaScript, CSS), debugging, browser-based testing, and manuscript drafting. The author takes full responsibility for the content, accuracy, and conclusions.
Keywords: ethical AI, Comma v0.1, Common Pile, Apple MLX, local inference, ecological summarization, Macroscope
Abstract
This technical note documents the design, implementation, and evaluation of the Ethical AI Lab, an experimental dashboard that connects the Macroscope sensor network to a locally-running, ethically-sourced language model for ecological data summarization. The system uses Comma v0.1-2T, a 7-billion parameter model trained exclusively on public domain and openly licensed text (Common Pile v0.1), running via Apple MLX on an M4 Max processor. The dashboard fetches live monitoring data from three sensor types (WeatherFlow Tempest, BirdWeather acoustic network, and Ecowitt garden station) through the macroscope database, assembles structured prompts with ecological context from macroscope_nexus curated place records, and sends them to the local model for summarization. Evaluation reveals that Comma produces serviceable prose synthesis of structured data but fails reliably on quantitative operations including ranking, comparison, and arithmetic. These findings define the operational envelope for small, ethically-sourced models in ecological monitoring: useful as a data-to-prose narration layer when paired with programmatic analysis, but unsuitable for independent analytical reasoning. The complete system runs without cloud dependency, using no data from proprietary or ethically questionable training sources.
1. Introduction
The integration of large language models into ecological monitoring systems raises two distinct challenges: technical capability and ethical provenance. Most capable models are trained on datasets of uncertain origin, incorporating copyrighted material, personal data scraped without consent, and content from communities that received no compensation or attribution. For a research program committed to transparency and ethical practice, the question is not merely whether an AI model can summarize sensor data, but whether it can do so without complicity in the extractive data practices that characterize the current AI industry.
This technical note documents an experiment to answer that question. The Ethical AI Lab is a prototype dashboard that connects the Macroscope sensor network at the Canemah Nature Laboratory to Comma v0.1-2T, a 7-billion parameter language model trained exclusively on the Common Pile v0.1 corpus, which contains only public domain works and text released under open licenses. The model runs locally on Apple Silicon via the MLX framework, with no cloud dependency and no data leaving the local network.
The objectives of this experiment were threefold: (1) to establish whether a small, ethically-sourced model can produce useful ecological data summaries, (2) to identify the specific capabilities and limitations of such a model through systematic evaluation, and (3) to define the architectural pattern by which a local model can be integrated into the Macroscope intelligence pipeline.
2. Background
2.1 The Ethical Provenance Problem
Commercial language models from OpenAI, Google, Meta, and others are trained on massive web scrapes that include copyrighted books, news articles, academic papers, and personal content harvested without consent. Recent legal settlements and leaked documents have confirmed the scale of this appropriation. For the Canemah Nature Laboratory, which operates under a commitment to transparent, ethically-grounded research, these models present an unacceptable dependency regardless of their technical capability.
2.2 Common Pile and Comma v0.1
The Common Pile v0.1 is a curated text corpus assembled by the Pile project community, containing only public domain works and text released under permissive open licenses (CC-BY, CC-BY-SA, and similar). The Comma v0.1-2T model was trained on this corpus by the Common Pile project, producing a 7-billion parameter LLaMA-architecture model with a 16,384-token context window. While modest by current standards, this model represents one of the few options for language model inference that can claim clean ethical provenance.
2.3 Apple MLX
MLX is Apple's machine learning framework for Apple Silicon, providing GPU-accelerated inference through the Metal API. The mlx-lm package provides a Python interface for loading and running language models in MLX format. This eliminates the need for intermediate frameworks like Ollama, running the model directly on the M4 Max GPU with minimal overhead.
3. System Architecture
The Ethical AI Lab consists of four layers: a PHP web dashboard, a MySQL data layer spanning two databases, a Python inference server, and the Comma model running on MLX. All components run on a single machine (MacBook Pro M4 Max) with no external network dependencies.
3.1 Component Overview
| Component | Technology | Role | Notes |
|---|---|---|---|
| Web Dashboard | PHP 8.5, HTML5, vanilla JS | User interface | Place search, summarize, chat, raw data display |
| Place Registry | macroscope_nexus (MySQL 8.4) | Ecological context | Curated places, monitoring sources, narratives |
| Sensor Data | macroscope (MySQL 8.4) | Time-series data | Tempest, BirdWeather, Ecowitt readings |
| Inference Server | FastAPI + uvicorn (Python) | LLM gateway | Thread-locked, POST /generate endpoint |
| Language Model | Comma v0.1-2T via mlx-lm | Text generation | 7B params, 16K context, Metal GPU |
Table 1. System components and their roles.
3.2 Data Flow
The summarization pipeline proceeds as follows: (1) the user selects a curated place from the searchable dropdown, (2) PHP queries macroscope_nexus for the place record including description, curated narrative, category, and elevation, (3) PHP queries monitoring_sources for active sensors at that place, (4) for each sensor, PHP queries the macroscope database for the most recent readings using source-type-specific fetcher functions, (5) PHP assembles a structured prompt combining ecological context with current sensor data, (6) the prompt is sent via HTTP POST to the local Comma server at 127.0.0.1:8000, (7) the raw completion is post-processed by a regex-based cleanup function that truncates at known junk markers, and (8) the cleaned summary is returned to the browser alongside the raw data and the full prompt for transparency.
3.3 Chat Interface
In addition to the one-shot summarization, the dashboard provides a chat interface that allows the user to ask questions about the current conditions data. Each question triggers a fresh data fetch and prompt assembly, ensuring the model always sees the most recent sensor readings. The chat prompt includes the full ecological context (place description, curated narrative, coordinates, elevation) alongside the sensor data, grounded by the instruction to answer using only the provided data.
3.4 Post-Processing Pipeline
Small language models exhibit characteristic failure modes including self-evaluation loops, prompt echoing, and self-generated Q&A sequences. The cleanSummary() function in PHP addresses these through sequential filtering: (1) truncation at known stop markers (RESPONSE:, SYSTEM:, USER:, Example, etc.), (2) detection and truncation at self-generated Question: patterns, (3) truncation at duplicate Answer: markers, and (4) sentence-boundary trimming for longer outputs with passthrough for short direct answers. This post-processing is essential for producing usable output from the model.
4. Implementation Details
4.1 File Structure
| File | Purpose |
|---|---|
| index.php | Dashboard page with place search, summary panel, chat, raw data |
| js/app.js | Frontend logic: place search, summarize/chat requests, rendering |
| css/style.css | Lab aesthetic: minimal, readable, dark-on-light |
| lib/env.php | Environment bootstrap, sets SECURE_BASE for credentials |
| lib/database.php | Dual connections: nexusConnect(), macroConnect() |
| lib/fetchers.php | Source-type fetchers: fetchWeather(), fetchBirds(), fetchGarden() |
| lib/llm.php | Prompt builders, Comma client, cleanSummary() post-processor |
| api/places.php | Place list and detail endpoints |
| api/conditions.php | Per-source condition fetching |
| api/summarize.php | Full pipeline: fetch, prompt, call Comma, return |
| api/chat.php | Q&A endpoint: question + conditions + context to Comma |
| comma-server.py | FastAPI wrapper for Comma via mlx-lm with thread lock |
Table 2. Project file structure.
4.2 Inference Server
The Comma server (comma-server.py) is a minimal FastAPI application that loads the model at startup and exposes a POST /generate endpoint. A critical implementation detail is the threading.Lock() that serializes all inference calls. MLX Metal operations are not thread-safe; concurrent requests to the model cause GPU command buffer collisions that crash the process with a segmentation fault. The lock ensures only one inference runs at a time, with a 60-second timeout for queued requests. A GET /health endpoint provides liveness checking.
4.3 Token Budget
Comma v0.1-2T supports a 16,384-token context window. The summarize prompt for the Canemah Nature Laboratory, including the full curated narrative and data from all five monitoring sources, measures approximately 6,500 characters or roughly 1,600 to 1,860 tokens. With the 256-token generation limit, each request consumes approximately 12 to 13 percent of the available context window. This leaves substantial headroom for historical trend data, longer species lists, or multi-turn conversation history.
5. Evaluation Results
The following results are drawn from live testing against the Canemah Nature Laboratory place record with five active monitoring sources: WeatherFlow Tempest (weather station), BirdWeather Acoustic Network (acoustic monitor), Ecowitt Garden Station (garden station), and two unsupported types (STRATA climate logger and panoramic camera).
5.1 Prose Synthesis (Summarization)
Comma produced coherent, factually accurate prose summaries of structured sensor data. A representative output:
The weather at the Canemah Nature Laboratory is currently sunny with a temperature of 46 degrees Fahrenheit, a humidity of 87%, and a wind speed of 3 mph. The bird activity is high with 709 detections and 24 species, including White-crowned Sparrow, Northern Flicker, Bushtit, Golden-crowned Sparrow, House Finch, Anna's Hummingbird, Dark-eyed Junco, and American Crow. The garden conditions are dry with a soil moisture of 66% and a leaf wetness of 0.
Every data point in this summary is accurately reported from the sensor readings. The model correctly characterized bird activity as "high" (709 detections across 24 species is above typical baseline) and garden conditions as "dry" (leaf wetness of 0). The prose reads naturally and would be serviceable as a field station status report.
5.2 Contextual Reasoning
When provided with the curated ecological narrative for Canemah, the model demonstrated an ability to synthesize place context with species observations. Asked "why might White-crowned Sparrows be so abundant here," Comma responded:
White-crowned Sparrows are abundant here because they are a common bird species in the Willamette Valley oak savanna ecoregion, which is a Nature Imperiled habitat type that has suffered vast losses since European settlement.
This response connects the species observation to the ecoregional context from the curated narrative, demonstrating that the model can ground its answers in the provided ecological context rather than relying solely on general knowledge.
5.3 Factual Lookup
For simple factual questions where the answer is directly present in the data, Comma performed well. "What is the most common bird being detected" correctly returned "White-crowned Sparrow" (293 detections, the highest count). "Is the soil temperature warmer than the air temperature" correctly returned "yes" (soil 47.7F vs. air 46F).
5.4 Quantitative Reasoning (Failure Mode)
Comma failed consistently on questions requiring comparison, ranking, or arithmetic across data values. Asked "what is the least detected bird this morning," the model returned "Anna's Hummingbird" (26 detections, mid-pack). On a second attempt with different phrasing, it returned "American Crow" (25 detections, also incorrect). In both cases, the model selected a plausible-sounding species rather than performing the min-value operation on the detection counts. This failure is consistent and represents a fundamental limitation of the model's reasoning capability.
5.5 Output Discipline (Failure Mode, Mitigated)
Without post-processing, Comma exhibited three characteristic failure modes: (1) self-evaluation, where the model appends meta-commentary such as "RESPONSE: The assistant's response is accurate and concise," (2) prompt echoing, where the model regurgitates the input data after the summary, and (3) self-generated Q&A loops, where the model generates its own questions and answers after responding to the actual query. All three modes are effectively mitigated by the cleanSummary() post-processing function, which truncates output at known markers and trims to sentence boundaries.
5.6 Capability Summary
| Capability | Rating | Notes |
|---|---|---|
| Prose synthesis from data | Good | Accurate, readable field report summaries |
| Contextual reasoning | Good | Connects species to ecoregional narrative |
| Simple factual lookup | Good | Correct when answer is obvious from data |
| Yes/no comparison | Fair | Correct for simple pairs, unreliable for close values |
| Ranking / min-max | Poor | Confabulates plausible answers instead of computing |
| Arithmetic | Poor | Cannot reliably add, subtract, or average values |
| Output discipline | Poor (raw) | Self-evaluation loops, prompt echoing, Q&A generation |
| Output discipline | Good (filtered) | cleanSummary() post-processor effectively mitigates |
Table 3. Capability assessment summary.
6. Technical Challenges
6.1 MLX Metal GPU Crashes
The initial mlx package version exhibited a Metal command buffer assertion failure (AGXG16X family) when running inference on the M4 Max GPU. The error manifested as a segmentation fault that killed the Python process. This was resolved by upgrading mlx and mlx-lm to the latest versions via pip. A secondary crash mode occurred when two HTTP requests hit the inference endpoint simultaneously, causing concurrent Metal operations. This was resolved by adding a threading.Lock() to serialize all model access.
6.2 mlx-lm API Instability
The mlx-lm package changed its generate() function signature between versions. The temperature parameter, previously a direct keyword argument, was moved to a sampler object via sample_utils.make_sampler(). The final implementation drops temperature control entirely, using model defaults, to avoid API compatibility issues.
6.3 PHP 8.5 Compatibility
PHP 8.5 deprecates curl_close(), which previously was called explicitly after each HTTP request. The deprecation warning was emitted as HTML before the JSON response body, causing frontend JSON parse failures. The fix was to remove the curl_close() call entirely, as PHP 8.0 and later auto-close curl handles.
7. Discussion
7.1 The Narration Layer Pattern
The evaluation results define a clear architectural role for small, ethically-sourced models in the Macroscope intelligence pipeline. The model functions as a narration layer: it translates structured data and pre-computed analytical results into readable prose. It does not perform the analysis itself. In the four-tier intelligence pipeline (Observation, Verification, Interpretation, Discovery), Comma operates at the boundary between Verification and Interpretation, converting verified data streams into human-readable summaries.
The critical design principle is that quantitative analysis must be performed programmatically in PHP before the results are passed to the model. Trend analysis, anomaly detection, statistical comparisons, and rankings should be computed by code (or by SOMA's RBM meshes) and presented to the model as pre-digested facts. The model's role is to assemble these facts into coherent prose, not to derive them.
7.2 Ethical Provenance as Design Constraint
Operating under an ethical provenance constraint significantly limits model selection. Comma v0.1-2T is, to the author's knowledge, one of the few models that can credibly claim clean training data sourcing. The Allen AI OLMo models, while transparent about their Dolma training corpus, ultimately rely on Common Crawl web scrapes that include copyrighted content harvested without consent. Transparency about the scraping does not make the scraping ethical.
The performance ceiling of a 7B public-domain model is real but acceptable for the defined use case. The model does not need to reason about ecology, discover patterns, or generate hypotheses. It needs to turn numbers into sentences. For that specific task, Comma v0.1 is adequate.
7.3 Future Directions
Several extensions are planned: (1) a PHP-based trend analysis layer that computes rolling averages, deltas, and percentile rankings from historical sensor data, passing pre-computed trends to Comma for narration; (2) integration with SOMA anomaly detection, where free energy spikes and tension patterns are translated into natural language by the model; (3) expanded monitoring source support as additional sensor types come online in the Macroscope network; and (4) evaluation of larger Common Pile models as they become available.
8. Limitations
This evaluation is limited to a single place (Canemah Nature Laboratory) with three active sensor types. The model's performance on places with thinner ecological narratives or different sensor configurations has not been tested. The evaluation is qualitative rather than systematic; a rigorous benchmark would require a standardized question set with ground-truth answers. The model's behavior may vary across mlx-lm versions, and the post-processing heuristics are tuned to observed failure modes that may change with different model versions or prompting strategies.
9. Conclusion
The Ethical AI Lab demonstrates that a locally-running, ethically-sourced language model can provide useful ecological data summarization when properly constrained. Comma v0.1-2T, operating on Apple MLX without cloud dependency, produces accurate prose summaries of multi-source sensor data and can synthesize species observations with curated ecological context. Its failure on quantitative reasoning tasks is consistent and predictable, defining a clear boundary for its operational role. The appropriate architectural pattern is a narration layer: programmatic analysis computes the facts, and the model turns them into readable prose. This pattern preserves ethical provenance while delivering practical utility within the Macroscope intelligence pipeline.
10. References
[1] Common Pile Project (2024). "Common Pile v0.1: A Public Domain Training Corpus." https://huggingface.co/datasets/common-pile/common-pile-v0.1 (accessed April 2, 2026).
[2] Common Pile Project (2024). "Comma v0.1-2T." https://huggingface.co/common-pile/comma-v0.1-2t (accessed April 2, 2026).
[3] Willison, S. (2024). "comma-v0.1-2t-mlx: MLX-format Comma model." https://huggingface.co/simonw/comma-v0.1-2t-mlx (accessed April 2, 2026).
[4] Apple (2024). "MLX: An Array Framework for Apple Silicon." https://github.com/ml-explore/mlx (accessed April 2, 2026).
[5] Hamilton, M. P. (2026). "Towards an Ethical AI." Coffee with Claude. https://coffeewithclaude.com (accessed April 2, 2026).
[6] Hamilton, M. P. (2026). "SOMA: Stochastic Observatory for Mesh Awareness." CNL-TN-2026-027. Canemah Nature Laboratory.
[7] Hamilton, M. P. (2026). "Organelle Convergence Architecture for MNG." CNL-FN-2026-026. Canemah Nature Laboratory.
Appendix A: Hardware Configuration
| Parameter | Value |
|---|---|
| Machine | MacBook Pro M4 Max |
| CPU | Apple M4 Max (16-core) |
| GPU | Apple M4 Max (40-core, Metal 3) |
| Memory | Unified 128 GB |
| OS | macOS (current) |
| Python | 3.12 (miniforge3) |
| PHP | 8.5.4 (Homebrew) |
| MySQL | 8.4 (local) |
| Apache | 2.4.66 (Homebrew) |
| Model | Comma v0.1-2T (7B params, BF16, ~15 GB) |
| Inference | mlx-lm via Apple MLX (Metal GPU) |
Table 4. Hardware and software configuration.
Document History
| Version | Date | Changes |
|---|---|---|
| 1.0 | 2026-04-02 | Initial release |
Cite This Document
BibTeX
Permanent URL: https://canemah.org/archive/document.php?id=CNL-TN-2026-039