CNL-WP-2026-022 Working Paper Viewing v1 — View latest (v6)

Toward a Large Sensor Model for Ecological Perception

Published: February 12, 2026 Version: 1

Toward a Large Sensor Model: From GraphCast to Ecological Foundation Models

From Weather Prediction to Ecosystem Perception

Document ID: CNL-FN-2026-022
Date: February 12, 2026
Author: Michael P. Hamilton, Ph.D.
Project: Macroscope Ecological Observatory


AI Assistance Disclosure: This field note was developed with assistance from Claude (Anthropic). The AI contributed to literature synthesis and manuscript drafting through extended dialogue. The author takes full responsibility for the content, accuracy, and conclusions.


Abstract

Google DeepMind’s GraphCast demonstrated that a machine learning model trained on 39 years of atmospheric reanalysis data can outperform physics-based weather forecasting systems on 90% of verification targets. SOMA (Stochastic Observatory for Mesh Awareness), operating at Canemah Nature Laboratory, demonstrated that Restricted Boltzmann Machines trained on ecological sensor data can detect cross-domain anomalies invisible to single-domain analysis. This field note argues that these results converge on a single architectural trajectory: a Large Sensor Model (LSM) for multi-domain ecological perception, analogous to a Large Language Model but trained on continuous sensor streams rather than text tokens. Where GraphCast learns the statistical structure of the atmosphere across planetary scale, an LSM would learn the statistical structure of an ecosystem across multiple observational domains—weather, biodiversity, habitat, and human presence—at site scale. The Boltzmann machine weights learned by SOMA represent seed initialization for such a model.


1. The GraphCast Precedent

GraphCast (Lam et al. 2023) replaced physics-based numerical weather prediction with learned statistical structure. The model operates on graph neural networks across an icosahedral mesh representing Earth’s surface, predicting 227 atmospheric variables at 0.25° resolution for 10-day horizons. Training required no fluid dynamics equations—only 39 years of ERA5 reanalysis data from ECMWF. The model generates 10-day forecasts in under one minute on a single machine, compared to hours on supercomputers for conventional systems.

The architectural insight is that atmospheric dynamics, despite their apparent complexity, occupy a learnable manifold. The joint probability distribution over atmospheric state variables is structured enough that a neural network can internalize it from historical examples alone. GraphCast does not simulate physics; it recognizes patterns.

GenCast (Price et al. 2023) extended this with diffusion-based ensemble forecasting, generating probabilistic predictions that capture forecast uncertainty—a capability that maps directly to the energy-based probabilistic framework already implemented in SOMA.

2. SOMA as Proof of Concept

SOMA implements three Restricted Boltzmann Machine meshes at Canemah Nature Laboratory, trained on 118 days of weather and acoustic biodiversity data (Hamilton 2026). The ecosystem mesh—65 visible nodes encoding both weather variables and species detection patterns, connected to 100 hidden nodes—successfully detected a cross-domain anomaly where weather and species conditions were individually unremarkable but their combination violated learned expectations.

This result establishes a critical principle: joint distribution modeling across ecological domains captures structure that domain-specific monitoring misses. The RBM weights encode learned correlations between atmospheric conditions and biological activity—the statistical signature of how energy flows through a landscape and how organisms respond.

3. The Architectural Bridge

The path from Boltzmann machines to deep neural networks is historically foundational. Hinton’s 2006 breakthrough demonstrated that deep networks could be trained by stacking pre-trained RBMs, with each layer’s weights providing initialization for the next. SOMA’s trained meshes sit at precisely this starting point.

A Large Sensor Model scales the architecture along three axes:

Depth. Stack multiple layers where each encodes a different temporal scale. The deepest layers capture multi-year climate normals; middle layers encode seasonal rhythms; surface layers encode daily cycles and momentary state. This temporal topology—described in CNL-TN-2026-014—embeds time in architecture rather than data summaries.

Width. Expand from 65 visible nodes to thousands. Current SOMA encodes weather (35 nodes) and bird species (27 nodes) plus temporal context. At scale, visible nodes encompass all four Macroscope domains: EARTH (weather at multiple temporal lags, derived variables, multi-site readings), LIFE (species-hour activity matrices, phenological markers, camera detections), HOME (indoor environment, energy systems, infrastructure state), and SELF (health metrics, activity patterns, cognitive rhythms). At 10,000 visible nodes, the model perceives the full Macroscope as a single energy landscape.

Attention. Replace the bipartite RBM architecture with transformer-style self-attention, where each input dimension can attend to every other. The multi-head attention mechanism naturally discovers cross-domain couplings: some heads learn EARTH-EARTH correlations (pressure-wind dynamics), others learn EARTH-LIFE couplings (weather-species relationships), others learn HOME-SELF interactions (indoor environment-health connections). These relationships emerge from training rather than specification.

4. Divergence from Language Models

The analogy between an LSM and an LLM is structurally precise but diverges in ways that favor the sensor domain.

Continuous vs. discrete input. LLMs operate on discrete tokens; sensor streams are continuous and temporally structured. This richer input structure means the model can learn physics—actual causal relationships—not just statistical co-occurrence.

Lower entropy. The number of meaningfully different ecosystem states is far smaller than the number of meaningful English sentences. The learnable manifold is more compact, suggesting that effective training requires orders of magnitude less data than language modeling.

Prediction as perception. An LLM generates text; an LSM generates predictions of ecosystem state. When tomorrow’s observations deviate from the model’s prediction, the prediction error is the anomaly signal—the same principle SOMA implements with energy-based tension, but with the representational power of a deep network.

Transfer learning. A model trained on Canemah data could transfer to Owl Farm. Deep layers encoding general ecological dynamics—how weather affects biology, how seasons structure activity—should generalize across sites. Shallow layers retrain on local species assemblages, microclimate, and phenology. This is the foundation model paradigm: pre-train on general structure, fine-tune on specific context.

5. What an LSM Produces

GraphCast predicts the atmosphere. An LSM predicts the ecosystem. Not forecasting through differential equations, but pattern completion: given the last week of multi-domain sensor data, the model completes the sequence with what should come next. The output is a predicted state vector spanning all four domains—expected weather, expected species activity, expected indoor conditions, expected human patterns.

The deviation between prediction and observation becomes the primary signal. Large deviations indicate surprise: something the ecosystem is doing that it has never done before in this context. Small deviations indicate normalcy: the landscape feels right.

This is the embodied perception described in CNL-TN-2026-014, scaled from a single RBM layer to a deep architecture capable of capturing the full complexity of ecosystem dynamics.

6. Current Position

The components exist. SOMA provides trained Boltzmann machine weights encoding weather-species correlations at Canemah. GraphCast provides architectural precedent for learning complex physical dynamics from observational data using graph neural networks in JAX. The Macroscope archive contains two years of continuous multi-domain sensor data. The M4 Max hardware with Metal acceleration supports scaling to thousands of nodes.

The next steps are: expand SOMA’s visible layer to incorporate the full sensor complement across all four domains; implement stacked RBM pre-training to initialize a deeper architecture; evaluate transformer-based attention mechanisms for cross-domain coupling discovery; and develop transfer learning protocols for multi-site deployment.

No one has built a multi-domain ecological foundation model. The precedent says it works for atmosphere. The proof of concept says it works for ecosystem cross-domain perception. The architecture is clear.


References


End of Field Note

Cite This Document

(2026). "Toward a Large Sensor Model for Ecological Perception." Canemah Nature Laboratory Working Paper CNL-WP-2026-022v1. https://canemah.org/archive/CNL-WP-2026-022v1

BibTeX

@unpublished{cnl2026toward, author = {}, title = {Toward a Large Sensor Model for Ecological Perception}, institution = {Canemah Nature Laboratory}, year = {2026}, number = {CNL-WP-2026-022}, month = {february}, url = {https://canemah.org/archive/document.php?id=CNL-WP-2026-022}, abstract = {Google DeepMind’s GraphCast demonstrated that machine learning models trained on historical atmospheric data can outperform physics-based weather forecasting on 90\% of verification targets. BioAnalyst (Trantas et al. 2025) demonstrated that multimodal foundation models can learn joint species-climate representations from satellite and occurrence data at continental scale. SOMA (Stochastic Observatory for Mesh Awareness), operating at Canemah Nature Laboratory, demonstrated that energy-based models trained on ecological sensor data can detect cross-domain anomalies invisible to single-domain analysis. This proposal argues that these results converge on an unrealized architectural target: a Large Sensor Model (LSM) for multi-domain ecological perception—a foundation model trained not on gridded reanalysis products or satellite imagery but on continuous ground-truth environmental sensor streams, learning the joint probability distribution over atmospheric and biological variables at the temporal and spatial resolution where ecological interactions actually occur. Critically, the infrastructure to train such a model already exists. Consumer weather station networks—WeatherFlow-Tempest, Ambient Weather, and Davis Instruments—collectively operate over half a million standardized stations worldwide with API access. BirdWeather maintains approximately 2,000 active acoustic monitoring stations running BirdNET species classification. iNaturalist has accumulated over 250 million verifiable observations—170 million at research grade—spanning all major taxonomic groups with geolocated timestamps. These platforms collectively provide the training corpus for a foundation model of ecosystem dynamics, requiring no new hardware deployment. We propose a phased development path: beginning with paired-site validation (Canemah, Oregon and Bellingham, Washington), expanding through regional recruitment of existing weather station and BirdWeather operators in the Pacific Northwest, and scaling organically as the network demonstrates value—following the grassroots trajectory that built iNaturalist from a graduate student project into a global biodiversity platform with 4 million observers. The resulting model would learn weather-biodiversity coupling across ecological gradients, producing predictions of ecosystem state whose deviations from observation constitute an anomaly detection system of unprecedented scope.} }

Permanent URL: https://canemah.org/archive/document.php?id=CNL-WP-2026-022