CNL-TN-2026-039 Technical Note

Ethical AI Lab: Evaluating Local Ecological Summarization with Comma v0.1 on Apple MLX

Michael P. Hamilton , Ph.D.

Published: April 2, 2026 Version: 2

Ethical AI Lab: Evaluating Local Ecological Summarization with Comma v0.1 on Apple MLX

Document ID: CNL-TN-2026-039 Version: 1.0 Date: April 2, 2026 Author: Michael P. Hamilton, Ph.D.

AI Assistance Disclosure: This technical note was developed with assistance from Claude (Anthropic, Claude Opus 4.6) operating in Cowork mode within the Claude Desktop application. Claude contributed to system architecture, code development (PHP, Python, JavaScript, CSS), debugging, browser-based testing, and manuscript drafting. The author takes full responsibility for the content, accuracy, and conclusions.

Keywords: ethical AI, Comma v0.1, Common Pile, Apple MLX, local inference, ecological summarization, Macroscope

Abstract

This technical note documents the design, implementation, and evaluation of the Ethical AI Lab, an experimental dashboard that connects the Macroscope sensor network to a locally-running, ethically-sourced language model for ecological data summarization. The system uses Comma v0.1-2T, a 7-billion parameter model trained exclusively on public domain and openly licensed text (Common Pile v0.1), running via Apple MLX on an M4 Max processor. The dashboard fetches live monitoring data from three sensor types (WeatherFlow Tempest, BirdWeather acoustic network, and Ecowitt garden station) through the macroscope database, assembles structured prompts with ecological context from macroscope_nexus curated place records, and sends them to the local model for summarization. Evaluation reveals that Comma produces serviceable prose synthesis of structured data but fails reliably on quantitative operations including ranking, comparison, and arithmetic. These findings define the operational envelope for small, ethically-sourced models in ecological monitoring: useful as a data-to-prose narration layer when paired with programmatic analysis, but unsuitable for independent analytical reasoning. The complete system runs without cloud dependency, using no data from proprietary or ethically questionable training sources.

1. Introduction

The integration of large language models into ecological monitoring systems raises two distinct challenges: technical capability and ethical provenance. Most capable models are trained on datasets of uncertain origin, incorporating copyrighted material, personal data scraped without consent, and content from communities that received no compensation or attribution. For a research program committed to transparency and ethical practice, the question is not merely whether an AI model can summarize sensor data, but whether it can do so without complicity in the extractive data practices that characterize the current AI industry.

This technical note documents an experiment to answer that question. The Ethical AI Lab is a prototype dashboard that connects the Macroscope sensor network at the Canemah Nature Laboratory to Comma v0.1-2T, a 7-billion parameter language model trained exclusively on the Common Pile v0.1 corpus, which contains only public domain works and text released under open licenses. The model runs locally on Apple Silicon via the MLX framework, with no cloud dependency and no data leaving the local network.

The objectives of this experiment were threefold: (1) to establish whether a small, ethically-sourced model can produce useful ecological data summaries, (2) to identify the specific capabilities and limitations of such a model through systematic evaluation, and (3) to define the architectural pattern by which a local model can be integrated into the Macroscope intelligence pipeline.

2. Background

2.1 The Ethical Provenance Problem

Commercial language models from OpenAI, Google, Meta, and others are trained on massive web scrapes that include copyrighted books, news articles, academic papers, and personal content harvested without consent. Recent legal settlements and leaked documents have confirmed the scale of this appropriation. For the Canemah Nature Laboratory, which operates under a commitment to transparent, ethically-grounded research, these models present an unacceptable dependency regardless of their technical capability.

2.2 Common Pile and Comma v0.1

The Common Pile v0.1 is a curated text corpus assembled by the Pile project community, containing only public domain works and text released under permissive open licenses (CC-BY, CC-BY-SA, and similar). The Comma v0.1-2T model was trained on this corpus by the Common Pile project, producing a 7-billion parameter LLaMA-architecture model with a 16,384-token context window. While modest by current standards, this model represents one of the few options for language model inference that can claim clean ethical provenance.

2.3 Apple MLX

MLX is Apple's machine learning framework for Apple Silicon, providing GPU-accelerated inference through the Metal API. The mlx-lm package provides a Python interface for loading and running language models in MLX format. This eliminates the need for intermediate frameworks like Ollama, running the model directly on the M4 Max GPU with minimal overhead.

3. System Architecture

The Ethical AI Lab consists of four layers: a PHP web dashboard, a MySQL data layer spanning two databases, a Python inference server, and the Comma model running on MLX. All components run on a single machine (MacBook Pro M4 Max) with no external network dependencies.

3.1 Component Overview

Component	Technology	Role	Notes
Web Dashboard	PHP 8.5, HTML5, vanilla JS	User interface	Place search, summarize, chat, raw data display
Place Registry	macroscope_nexus (MySQL 8.4)	Ecological context	Curated places, monitoring sources, narratives
Sensor Data	macroscope (MySQL 8.4)	Time-series data	Tempest, BirdWeather, Ecowitt readings
Inference Server	FastAPI + uvicorn (Python)	LLM gateway	Thread-locked, POST /generate endpoint
Language Model	Comma v0.1-2T via mlx-lm	Text generation	7B params, 16K context, Metal GPU

Table 1. System components and their roles.

3.2 Data Flow

The summarization pipeline proceeds as follows: (1) the user selects a curated place from the searchable dropdown, (2) PHP queries macroscope_nexus for the place record including description, curated narrative, category, and elevation, (3) PHP queries monitoring_sources for active sensors at that place, (4) for each sensor, PHP queries the macroscope database for the most recent readings using source-type-specific fetcher functions, (5) PHP assembles a structured prompt combining ecological context with current sensor data, (6) the prompt is sent via HTTP POST to the local Comma server at 127.0.0.1:8000, (7) the raw completion is post-processed by a regex-based cleanup function that truncates at known junk markers, and (8) the cleaned summary is returned to the browser alongside the raw data and the full prompt for transparency.

3.3 Chat Interface

In addition to the one-shot summarization, the dashboard provides a chat interface that allows the user to ask questions about the current conditions data. Each question triggers a fresh data fetch and prompt assembly, ensuring the model always sees the most recent sensor readings. The chat prompt includes the full ecological context (place description, curated narrative, coordinates, elevation) alongside the sensor data, grounded by the instruction to answer using only the provided data.

3.4 Post-Processing Pipeline

Small language models exhibit characteristic failure modes including self-evaluation loops, prompt echoing, and self-generated Q&A sequences. The cleanSummary() function in PHP addresses these through sequential filtering: (1) truncation at known stop markers (RESPONSE:, SYSTEM:, USER:, Example, etc.), (2) detection and truncation at self-generated Question: patterns, (3) truncation at duplicate Answer: markers, and (4) sentence-boundary trimming for longer outputs with passthrough for short direct answers. This post-processing is essential for producing usable output from the model.

4. Implementation Details

4.1 File Structure

File	Purpose
index.php	Dashboard page with place search, summary panel, chat, raw data
js/app.js	Frontend logic: place search, summarize/chat requests, rendering
css/style.css	Lab aesthetic: minimal, readable, dark-on-light
lib/env.php	Environment bootstrap, sets SECURE_BASE for credentials
lib/database.php	Dual connections: nexusConnect(), macroConnect()
lib/fetchers.php	Source-type fetchers: fetchWeather(), fetchBirds(), fetchGarden()
lib/llm.php	Prompt builders, Comma client, cleanSummary() post-processor
api/places.php	Place list and detail endpoints
api/conditions.php	Per-source condition fetching
api/summarize.php	Full pipeline: fetch, prompt, call Comma, return
api/chat.php	Q&A endpoint: question + conditions + context to Comma
comma-server.py	FastAPI wrapper for Comma via mlx-lm with thread lock

Table 2. Project file structure.

4.2 Inference Server

The Comma server (comma-server.py) is a minimal FastAPI application that loads the model at startup and exposes a POST /generate endpoint. A critical implementation detail is the threading.Lock() that serializes all inference calls. MLX Metal operations are not thread-safe; concurrent requests to the model cause GPU command buffer collisions that crash the process with a segmentation fault. The lock ensures only one inference runs at a time, with a 60-second timeout for queued requests. A GET /health endpoint provides liveness checking.

4.3 Token Budget

Comma v0.1-2T supports a 16,384-token context window. The summarize prompt for the Canemah Nature Laboratory, including the full curated narrative and data from all five monitoring sources, measures approximately 6,500 characters or roughly 1,600 to 1,860 tokens. With the 256-token generation limit, each request consumes approximately 12 to 13 percent of the available context window. This leaves substantial headroom for historical trend data, longer species lists, or multi-turn conversation history.

5. Evaluation Results

The following results are drawn from live testing against the Canemah Nature Laboratory place record with five active monitoring sources: WeatherFlow Tempest (weather station), BirdWeather Acoustic Network (acoustic monitor), Ecowitt Garden Station (garden station), and two unsupported types (STRATA climate logger and panoramic camera).

5.1 Prose Synthesis (Summarization)

Comma produced coherent, factually accurate prose summaries of structured sensor data. A representative output:

The weather at the Canemah Nature Laboratory is currently sunny with a temperature of 46 degrees Fahrenheit, a humidity of 87%, and a wind speed of 3 mph. The bird activity is high with 709 detections and 24 species, including White-crowned Sparrow, Northern Flicker, Bushtit, Golden-crowned Sparrow, House Finch, Anna's Hummingbird, Dark-eyed Junco, and American Crow. The garden conditions are dry with a soil moisture of 66% and a leaf wetness of 0.

Every data point in this summary is accurately reported from the sensor readings. The model correctly characterized bird activity as "high" (709 detections across 24 species is above typical baseline) and garden conditions as "dry" (leaf wetness of 0). The prose reads naturally and would be serviceable as a field station status report.

5.2 Contextual Reasoning

When provided with the curated ecological narrative for Canemah, the model demonstrated an ability to synthesize place context with species observations. Asked "why might White-crowned Sparrows be so abundant here," Comma responded:

White-crowned Sparrows are abundant here because they are a common bird species in the Willamette Valley oak savanna ecoregion, which is a Nature Imperiled habitat type that has suffered vast losses since European settlement.

This response connects the species observation to the ecoregional context from the curated narrative, demonstrating that the model can ground its answers in the provided ecological context rather than relying solely on general knowledge.

5.3 Factual Lookup

For simple factual questions where the answer is directly present in the data, Comma performed well. "What is the most common bird being detected" correctly returned "White-crowned Sparrow" (293 detections, the highest count). "Is the soil temperature warmer than the air temperature" correctly returned "yes" (soil 47.7F vs. air 46F).

5.4 Quantitative Reasoning (Failure Mode)

Comma failed consistently on questions requiring comparison, ranking, or arithmetic across data values. Asked "what is the least detected bird this morning," the model returned "Anna's Hummingbird" (26 detections, mid-pack). On a second attempt with different phrasing, it returned "American Crow" (25 detections, also incorrect). In both cases, the model selected a plausible-sounding species rather than performing the min-value operation on the detection counts. This failure is consistent and represents a fundamental limitation of the model's reasoning capability.

5.5 Output Discipline (Failure Mode, Mitigated)

Without post-processing, Comma exhibited three characteristic failure modes: (1) self-evaluation, where the model appends meta-commentary such as "RESPONSE: The assistant's response is accurate and concise," (2) prompt echoing, where the model regurgitates the input data after the summary, and (3) self-generated Q&A loops, where the model generates its own questions and answers after responding to the actual query. All three modes are effectively mitigated by the cleanSummary() post-processing function, which truncates output at known markers and trims to sentence boundaries.

5.6 Capability Summary

Capability	Rating	Notes
Prose synthesis from data	Good	Accurate, readable field report summaries
Contextual reasoning	Good	Connects species to ecoregional narrative
Simple factual lookup	Good	Correct when answer is obvious from data
Yes/no comparison	Fair	Correct for simple pairs, unreliable for close values
Ranking / min-max	Poor	Confabulates plausible answers instead of computing
Arithmetic	Poor	Cannot reliably add, subtract, or average values
Output discipline	Poor (raw)	Self-evaluation loops, prompt echoing, Q&A generation
Output discipline	Good (filtered)	cleanSummary() post-processor effectively mitigates

Table 3. Capability assessment summary.

6. Technical Challenges

6.1 MLX Metal GPU Crashes

The initial mlx package version exhibited a Metal command buffer assertion failure (AGXG16X family) when running inference on the M4 Max GPU. The error manifested as a segmentation fault that killed the Python process. This was resolved by upgrading mlx and mlx-lm to the latest versions via pip. A secondary crash mode occurred when two HTTP requests hit the inference endpoint simultaneously, causing concurrent Metal operations. This was resolved by adding a threading.Lock() to serialize all model access.

6.2 mlx-lm API Instability

The mlx-lm package changed its generate() function signature between versions. The temperature parameter, previously a direct keyword argument, was moved to a sampler object via sample_utils.make_sampler(). The final implementation drops temperature control entirely, using model defaults, to avoid API compatibility issues.

6.3 PHP 8.5 Compatibility

PHP 8.5 deprecates curl_close(), which previously was called explicitly after each HTTP request. The deprecation warning was emitted as HTML before the JSON response body, causing frontend JSON parse failures. The fix was to remove the curl_close() call entirely, as PHP 8.0 and later auto-close curl handles.

7. Discussion

7.1 The Narration Layer Pattern

The evaluation results define a clear architectural role for small, ethically-sourced models in the Macroscope intelligence pipeline. The model functions as a narration layer: it translates structured data and pre-computed analytical results into readable prose. It does not perform the analysis itself. In the four-tier intelligence pipeline (Observation, Verification, Interpretation, Discovery), Comma operates at the boundary between Verification and Interpretation, converting verified data streams into human-readable summaries.

The critical design principle is that quantitative analysis must be performed programmatically in PHP before the results are passed to the model. Trend analysis, anomaly detection, statistical comparisons, and rankings should be computed by code (or by SOMA's RBM meshes) and presented to the model as pre-digested facts. The model's role is to assemble these facts into coherent prose, not to derive them.

7.2 Ethical Provenance as Design Constraint

Operating under an ethical provenance constraint significantly limits model selection. Comma v0.1-2T is, to the author's knowledge, one of the few models that can credibly claim clean training data sourcing. The Allen AI OLMo models, while transparent about their Dolma training corpus, ultimately rely on Common Crawl web scrapes that include copyrighted content harvested without consent. Transparency about the scraping does not make the scraping ethical.

The performance ceiling of a 7B public-domain model is real but acceptable for the defined use case. The model does not need to reason about ecology, discover patterns, or generate hypotheses. It needs to turn numbers into sentences. For that specific task, Comma v0.1 is adequate.

7.3 Future Directions

Several extensions are planned: (1) a PHP-based trend analysis layer that computes rolling averages, deltas, and percentile rankings from historical sensor data, passing pre-computed trends to Comma for narration; (2) integration with SOMA anomaly detection, where free energy spikes and tension patterns are translated into natural language by the model; (3) expanded monitoring source support as additional sensor types come online in the Macroscope network; and (4) evaluation of larger Common Pile models as they become available.

8. Limitations

This evaluation is limited to a single place (Canemah Nature Laboratory) with three active sensor types. The model's performance on places with thinner ecological narratives or different sensor configurations has not been tested. The evaluation is qualitative rather than systematic; a rigorous benchmark would require a standardized question set with ground-truth answers. The model's behavior may vary across mlx-lm versions, and the post-processing heuristics are tuned to observed failure modes that may change with different model versions or prompting strategies.

9. Conclusion

The Ethical AI Lab demonstrates that a locally-running, ethically-sourced language model can provide useful ecological data summarization when properly constrained. Comma v0.1-2T, operating on Apple MLX without cloud dependency, produces accurate prose summaries of multi-source sensor data and can synthesize species observations with curated ecological context. Its failure on quantitative reasoning tasks is consistent and predictable, defining a clear boundary for its operational role. The appropriate architectural pattern is a narration layer: programmatic analysis computes the facts, and the model turns them into readable prose. This pattern preserves ethical provenance while delivering practical utility within the Macroscope intelligence pipeline.

10. References

[1] Common Pile Project (2024). "Common Pile v0.1: A Public Domain Training Corpus." https://huggingface.co/datasets/common-pile/common-pile-v0.1 (accessed April 2, 2026).

[2] Common Pile Project (2024). "Comma v0.1-2T." https://huggingface.co/common-pile/comma-v0.1-2t (accessed April 2, 2026).

[3] Willison, S. (2024). "comma-v0.1-2t-mlx: MLX-format Comma model." https://huggingface.co/simonw/comma-v0.1-2t-mlx (accessed April 2, 2026).

[4] Apple (2024). "MLX: An Array Framework for Apple Silicon." https://github.com/ml-explore/mlx (accessed April 2, 2026).

[5] Hamilton, M. P. (2026). "Towards an Ethical AI." Coffee with Claude. https://coffeewithclaude.com (accessed April 2, 2026).

[6] Hamilton, M. P. (2026). "SOMA: Stochastic Observatory for Mesh Awareness." CNL-TN-2026-027. Canemah Nature Laboratory.

[7] Hamilton, M. P. (2026). "Organelle Convergence Architecture for MNG." CNL-FN-2026-026. Canemah Nature Laboratory.

Appendix A: Hardware Configuration

Parameter	Value
Machine	MacBook Pro M4 Max
CPU	Apple M4 Max (16-core)
GPU	Apple M4 Max (40-core, Metal 3)
Memory	Unified 128 GB
OS	macOS (current)
Python	3.12 (miniforge3)
PHP	8.5.4 (Homebrew)
MySQL	8.4 (local)
Apache	2.4.66 (Homebrew)
Model	Comma v0.1-2T (7B params, BF16, ~15 GB)
Inference	mlx-lm via Apple MLX (Metal GPU)

Table 4. Hardware and software configuration.

Document History

Version	Date	Changes
1.0	2026-04-02	Initial release

Cite This Document

Michael P. Hamilton, Ph.D. (2026). "Ethical AI Lab: Evaluating Local Ecological Summarization with Comma v0.1 on Apple MLX." Canemah Nature Laboratory Technical Note CNL-TN-2026-039. https://canemah.org/archive/CNL-TN-2026-039

BibTeX

@techreport{hamilton2026ethical, author = {Hamilton, Michael P., Ph.D.}, title = {Ethical AI Lab: Evaluating Local Ecological Summarization with Comma v0.1 on Apple MLX}, institution = {Canemah Nature Laboratory}, year = {2026}, number = {CNL-TN-2026-039}, month = {april}, url = {https://canemah.org/archive/document.php?id=CNL-TN-2026-039}, abstract = {This technical note documents the design, implementation, and evaluation of the Ethical AI Lab, an experimental dashboard that connects the Macroscope sensor network to a locally-running, ethically-sourced language model for ecological data summarization. The system uses Comma v0.1-2T, a 7-billion parameter model trained exclusively on public domain and openly licensed text (Common Pile v0.1), running via Apple MLX on an M4 Max processor. The dashboard fetches live monitoring data from three sensor types (WeatherFlow Tempest, BirdWeather acoustic network, and Ecowitt garden station) through the macroscope database, assembles structured prompts with ecological context from macroscope\_nexus curated place records, and sends them to the local model for summarization. Evaluation reveals that Comma produces serviceable prose synthesis of structured data but fails reliably on quantitative operations including ranking, comparison, and arithmetic. These findings define the operational envelope for small, ethically-sourced models in ecological monitoring: useful as a data-to-prose narration layer when paired with programmatic analysis, but unsuitable for independent analytical reasoning. The complete system runs without cloud dependency, using no data from proprietary or ethically questionable training sources.} }

Permanent URL: https://canemah.org/archive/document.php?id=CNL-TN-2026-039