CNL-WP-2026-022 Working Paper

Toward a Large Sensor Model for Ecological Perception

Published: February 12, 2026 Version: 6

Toward a Large Sensor Model for Ecological Perception

Leveraging Crowdsourced Environmental Infrastructure for a Foundation Model of Ecosystem Dynamics

Document ID: CNL-FN-2026-022
Date: February 12, 2026
Author: Michael P. Hamilton, Ph.D.
Project: Macroscope Ecological Observatory


AI Assistance Disclosure: This research proposal was developed with assistance from Claude (Anthropic). The AI contributed to literature synthesis, conceptual framework development, and manuscript drafting through extended dialogue. The author takes full responsibility for the content, accuracy, and conclusions.


Abstract

Google DeepMind’s GraphCast demonstrated that machine learning models trained on historical atmospheric data can outperform physics-based weather forecasting on 90% of verification targets. BioAnalyst (Trantas et al. 2025) demonstrated that multimodal foundation models can learn joint species-climate representations from satellite and occurrence data at continental scale. SOMA (Stochastic Observatory for Mesh Awareness), operating at Canemah Nature Laboratory, demonstrated that energy-based models trained on ecological sensor data can detect cross-domain anomalies invisible to single-domain analysis. This proposal argues that these results converge on an unrealized architectural target: a Large Sensor Model (LSM) for multi-domain ecological perception—a foundation model trained not on gridded reanalysis products or satellite imagery but on continuous ground-truth environmental sensor streams, learning the joint probability distribution over atmospheric and biological variables at the temporal and spatial resolution where ecological interactions actually occur.

Critically, the infrastructure to train such a model already exists. Consumer weather station networks—WeatherFlow-Tempest, Ambient Weather, and Davis Instruments—collectively operate over half a million standardized stations worldwide with API access. BirdWeather maintains approximately 2,000 active acoustic monitoring stations running BirdNET species classification. iNaturalist has accumulated over 250 million verifiable observations—170 million at research grade—spanning all major taxonomic groups with geolocated timestamps. These platforms collectively provide the training corpus for a foundation model of ecosystem dynamics, requiring no new hardware deployment.

We propose a phased development path: beginning with paired-site validation (Canemah, Oregon and Bellingham, Washington), expanding through regional recruitment of existing weather station and BirdWeather operators in the Pacific Northwest, and scaling organically as the network demonstrates value—following the grassroots trajectory that built iNaturalist from a graduate student project into a global biodiversity platform with 4 million observers. The resulting model would learn weather-biodiversity coupling across ecological gradients, producing predictions of ecosystem state whose deviations from observation constitute an anomaly detection system of unprecedented scope.


1. Introduction

1.1 The Problem

Ecological monitoring generates vast quantities of data with limited integration. Weather stations report atmospheric conditions. Acoustic monitors detect bird species. Camera traps record mammals. Citizen scientists photograph plants, insects, and fungi. Each stream is analyzed independently or, at best, correlated post hoc through statistical methods that require explicit hypothesis specification. No system learns the joint distribution over environmental and biological variables—the statistical structure of how ecosystems actually behave.

This limitation matters because ecological dynamics are fundamentally cross-domain. Barometric pressure affects bird vocalization patterns. Soil moisture modulates insect emergence. Photoperiod drives phenological transitions that cascade through food webs. These couplings are nonlinear, context-dependent, and vary across geography and season. Rule-based systems cannot enumerate them. Statistical correlation requires knowing what to look for. What is needed is a model that learns the structure from data, the way a field ecologist accumulates intuition through decades of observation.

1.2 The GraphCast Precedent

GraphCast (Lam et al. 2023), published in Science, demonstrated that a graph neural network trained on 39 years of atmospheric reanalysis data can predict 227 weather variables at 0.25° global resolution for 10-day horizons, outperforming physics-based forecasting on 90% of 1,380 verification targets. The model generates forecasts in under one minute on a single machine, versus hours on supercomputers for conventional numerical weather prediction.

GraphCast’s significance extends beyond weather. It proved that the statistical structure of a complex physical system—the atmosphere—can be learned from historical observations alone, without encoding any physics equations. The model operates on a graph neural network over an icosahedral mesh, where nodes represent spatial locations and edges encode how atmospheric state propagates between them. The learned weights embody the dynamics of a far-from-equilibrium thermodynamic system.

GenCast (Price et al. 2023) extended this with diffusion-based ensemble forecasting, generating probabilistic predictions that quantify forecast uncertainty—a capability directly aligned with the energy-based probabilistic framework already demonstrated in ecological contexts.

1.3 SOMA: Proof of Concept at Site Scale

SOMA (Hamilton 2026) implements three Restricted Boltzmann Machine meshes at Canemah Nature Laboratory, trained on 118 days of weather and acoustic biodiversity data. The ecosystem mesh—65 visible nodes encoding weather variables and species detection patterns, connected to 100 hidden nodes—successfully detected a cross-domain anomaly where weather and species conditions were individually unremarkable but their combination violated learned expectations.

This result established a critical principle: joint distribution modeling across ecological domains captures structure that domain-specific monitoring misses. The RBM weights encode learned correlations between atmospheric conditions and biological activity. When incoming sensor data violates those learned expectations, the mesh registers mathematical tension rather than calculating a derived metric—embodied perception rather than representational comparison.

1.4 The Convergence

GraphCast is, in essence, a Large Sensor Model scoped to a single domain (atmosphere) at planetary scale. BioAnalyst is a foundation model that integrates biodiversity and climate from satellite and occurrence data at continental scale and monthly resolution. SOMA is a Large Sensor Model scoped to multiple domains (atmosphere and biodiversity) at site scale, operating on ground-truth sensor streams at minute-level resolution. The architectural trajectory connecting SOMA to a continental-scale ground-truth ecological foundation model—from site-scale Boltzmann machines through stacked deep networks to graph neural networks operating across geographic space—is well-established in the machine learning literature.

What has been missing is the recognition that the training infrastructure for such a model already exists, deployed and operating, generating data continuously, accessible through standardized APIs—and that it captures ecological dynamics at a fundamentally different resolution than the satellite-and-reanalysis approach that existing foundation models employ.


2. The Existing Infrastructure

2.1 Consumer Weather Station Networks

Three major consumer weather station platforms operate large-scale networks with API access:

WeatherFlow-Tempest operates over 85,000 stations worldwide, each reporting identical variables through a standardized REST and WebSocket API: temperature, humidity, barometric pressure, wind speed and direction, solar radiation, UV index, and precipitation at one-minute intervals. The stations are factory-calibrated with consistent sensor packages.

Ambient Weather Network supports more than 250,000 personal and professional weather stations, with both RESTful and real-time APIs providing JSON-formatted data. The network spans multiple hardware configurations but produces standardized output through the AmbientWeather.net platform.

Davis Instruments WeatherLink connects 170,000+ stations globally—60,000 in the United States alone—through the WeatherLink v2 API. Davis stations, particularly the Vantage Pro2 line, have been the standard for semi-professional weather monitoring for decades, and many stations have deep historical archives. The WeatherLink network includes 200,000+ registered users.

Collectively, these three platforms represent over half a million consumer weather stations producing continuous atmospheric data with API access. The stations are not research-grade individually, but collectively powerful. Known biases (solar radiation heating from poor siting, wind obstruction) are characterizable, and a foundation model trained on thousands of stations learns to see through these biases because the signal is consistent while the noise is random across deployments. The networks are densest in North America, Europe, and Australia.

2.2 BirdWeather Acoustic Network

BirdWeather operates approximately 2,000 active acoustic monitoring stations globally, running continuous BirdNET neural network classification against audio streams. Each station reports species detections with timestamps, confidence scores, and species identification for over 6,000 recognized species. The PUC (Physical Universe Codec) hardware includes dual microphones, environmental sensors, GPS, and on-board neural processing. BirdNET-Pi stations running on Raspberry Pi hardware extend the network further. The system produces continuous presence/absence data for avian species, the most ecologically informative vertebrate taxon for phenological and community monitoring.

The first large-scale scientific use of the BirdWeather detection library—a study of light pollution effects on bird vocalization timing across species, space, and seasons—demonstrates the platform’s research utility.

2.3 iNaturalist Biodiversity Observations

iNaturalist has accumulated over 250 million verifiable observations from nearly 4 million observers worldwide, with approximately 170 million observations achieving research-grade identification through community consensus. The platform’s computer vision model recognizes over 112,000 taxa.

For the purposes of an ecological foundation model, iNaturalist provides what acoustic monitoring cannot: observations across all major taxonomic groups—plants, insects, fungi, amphibians, reptiles, and mammals—with geolocated timestamps and, critically, phenological annotations. Flowering dates, fruiting times, insect emergence, migration arrivals—the temporal structure of ecological communities is encoded in the collective observation record. The iNaturalist API provides programmatic access to research-grade observations filtered by taxon, location, date, and quality grade.

The data are episodic rather than continuous—clustered around population centers and weekends, biased toward charismatic taxa—but these biases are characterizable and the sheer volume provides statistical power. A foundation model trained on this data does not need every observation to be equally reliable; it needs the aggregate distribution to be ecologically meaningful.

2.4 The Grassroots Infrastructure Opportunity

What distinguishes these three platforms from institutional monitoring networks is their origin: they were built by communities of observers who wanted to understand their own environments. Tempest owners are weather enthusiasts who check their stations daily. BirdWeather operators listen to their local soundscapes. iNaturalist observers know their neighborhood phenology with an intimacy that no remote system can match. This is not a limitation—it is the project’s central strength.

iNaturalist began as a master’s project at UC Berkeley in 2008. It grew not through institutional mandate but because it gave individual naturalists something they wanted: a way to identify, record, and share their observations. Two decades later, it constitutes the largest biodiversity observation network in history. The trajectory is instructive. A foundation model for ecological perception need not begin with institutional partnerships and research-grade infrastructure. It needs to begin with a working system that gives individual monitoring station operators something valuable in return for their data: contextualized insight into what their local ecosystem is doing, and whether it is doing anything unusual.

2.5 The Participation Model

A critical design constraint: none of the weather station APIs provide unrestricted bulk access to other users’ data. Tempest limits API access to station owners. Davis WeatherLink restricts programmatic data retrieval to owned or shared stations. Ambient Weather offers the most open public data access but with reduced parameters. This is not a limitation to engineer around—it is the correct architecture for the project.

The LSM operates on an opt-in participation model. Each station owner authorizes their own data by connecting their API credentials to the network. In return, the model provides something no individual station can generate alone: contextualized ecological intelligence. Your weather station tells you it is 58°F with falling barometric pressure. The LSM tells you that this combination, at this time of year, in this bioregion, should be producing active bird vocalization—and if your BirdWeather station is recording silence instead, something ecologically significant may be happening.

This value exchange—your data for ecological context—is what drives network growth. It is the same mechanism that built iNaturalist: contribute an observation and receive an identification. Contribute a sensor stream and receive ecosystem perception. The network becomes more valuable to each participant as it grows, because more nodes mean richer context, finer spatial resolution, and more robust anomaly detection. This is a classic network effect, and it favors organic growth over institutional deployment.


3. Architecture

3.1 Domain Scope: EARTH and LIFE

We restrict the initial model to two domains: EARTH (atmospheric and environmental conditions) and LIFE (biodiversity patterns across taxa). This scoping decision is deliberate. Weather and biodiversity sensors produce relatively standardized, well-characterized data streams. The physics of weather-biology coupling is universal—every ecosystem on Earth experiences it. And the crowdsourced infrastructure described above provides dense coverage for precisely these two domains.

Excluding the Macroscope’s HOME and SELF domains eliminates idiosyncratic, non-generalizable data streams while focusing on what transfers across sites, ecosystems, and investigators. A model that learns how atmosphere and biosphere couple is scientifically general. A model that includes one person’s indoor temperature and sleep patterns is not.

3.2 Geographic Graph Structure

Each monitoring location becomes a node in a geographic graph. The graph structure is provided by biogeography itself—sites within the same ecoregion share species pools, climate drivers, and seasonal patterns. Edges connect nearby stations, with edge weights reflecting ecological similarity (shared species, correlated climate) rather than geographic distance alone.

A Tempest station in Portland and a Tempest station in Medford share Pacific Northwest weather patterns but differ in species communities. A BirdWeather station at sea level and one at 5,000 feet may be geographically close but ecologically distant. The graph neural network learns these relationships from data, discovering which connections carry ecological information.

This is the GraphCast architecture adapted to ecology. Where GraphCast’s icosahedral mesh tiles Earth’s surface uniformly, the ecological graph is irregular—dense where monitoring stations cluster, sparse where they do not. The model must learn to interpolate across gaps, a capability that graph neural networks handle naturally through message passing between connected nodes.

3.3 Multi-Stream Input Encoding

Each node receives three classes of input, each with different temporal characteristics:

Continuous streams (weather stations): Weather variables at one- to fifteen-minute intervals depending on platform. Temperature, humidity, pressure, wind, solar radiation, precipitation. These form the backbone temporal signal—the heartbeat of the EARTH domain.

Continuous acoustic streams (BirdWeather): Species detections at irregular intervals, aggregated into activity profiles per hour or per 15-minute window. Species presence/absence, detection confidence, vocalization timing. The LIFE domain’s continuous signal.

Episodic observations (iNaturalist): Species occurrences with timestamps, geolocated but temporally sparse. Plants, insects, fungi, mammals—taxa invisible to acoustic monitoring. Aggregated into monthly or seasonal phenological profiles per grid cell.

The architectural challenge is fusing these streams—continuous weather, continuous acoustics, and episodic community observations—into a coherent representation. Transformer architectures with learned temporal encodings handle variable sampling rates and irregular gaps naturally. Each input type receives its own positional encoding scheme: absolute timestamps for weather streams, event-based encoding for BirdWeather, and seasonal cyclical encoding for iNaturalist aggregates.

3.4 Temporal Hierarchy

Following the temporal topology described in CNL-TN-2026-014, the model embeds multiple timescales in its architecture:

Surface layer: Current conditions—the state of the atmosphere and biosphere right now. Updated with each sensor reading.

Diurnal layer: Daily patterns—dawn chorus timing, temperature cycling, nocturnal activity. Encodes what this time of day should feel like.

Seasonal layer: Phenological rhythms—when species should be active, what temperatures are normal for this week of the year. Encodes what this season should feel like.

Interannual layer: Climate context—ENSO state, long-term trends, multi-year baselines. Encodes what this year should feel like relative to the historical record.

Each layer provides context for the layers above it. A temperature anomaly at the surface layer is evaluated against the diurnal norm, the seasonal expectation, and the interannual trend simultaneously. The model does not report “temperature is 72°F” but rather “this February afternoon, in this La Niña year, feels warmer than it should.”

3.5 Output: Prediction and Anomaly

The model’s primary output is a predicted state vector for each node at the next time step—expected weather conditions and expected biological activity given the current state and all contextual layers. The predicted state spans all input dimensions: expected temperature, expected pressure, expected species activity by hour, expected phenological state.

The deviation between prediction and observation is the anomaly signal. Large deviations indicate surprise: the ecosystem is doing something it has never done before in this context. The deviation can be decomposed by domain (is the surprise in weather, in species, or in their coupling?), by timescale (is this a daily anomaly or a seasonal one?), and by spatial extent (is this local to one node or propagating across the graph?).

This is SOMA’s tension signal scaled to continental scope with deep temporal context.


4. Training Corpus

4.1 Scale

The combined data corpus is substantial:

Weather stations: 500,000+ stations across three API-accessible networks (Tempest, Ambient Weather, Davis WeatherLink), reporting at intervals from one minute to fifteen minutes, with archives spanning years to decades. At 10–15 variables per station, this represents billions of weather state vectors.

BirdWeather: 2,000+ stations × continuous detection streams × years of operation. Millions of species detection events with temporal and environmental context.

iNaturalist: 170+ million research-grade observations spanning all major taxa, geolocated and timestamped, with phenological annotations for plants.

4.2 Comparison to GraphCast

GraphCast trained on 39 years of ERA5 reanalysis data: 227 variables at ~1 million grid points, sampled every 6 hours. The ecological training corpus differs in structure—irregular spatial distribution, heterogeneous sampling rates, mixed continuous and episodic streams—but is comparable or larger in total information content.

The key advantage of ecological data is its lower entropy relative to atmospheric dynamics. The number of meaningfully different ecosystem states at a given location is far smaller than the number of meaningful atmospheric configurations. Weather is chaotic; ecosystems, while complex, are heavily constrained by biogeography, phenology, and energetics. The learnable manifold is more compact, suggesting that effective training may require less data per location than atmospheric modeling demands.

4.3 Data Quality and Bias

Consumer weather stations have known biases. BirdNET classifications carry false positive rates. iNaturalist observations cluster around cities and weekends. These limitations are real but manageable for three reasons.

First, systematic biases are characterizable and consistent within each platform. A foundation model trained on thousands of stations across multiple platforms learns the systematic offsets between hardware types and between consumer and research-grade measurements.

Second, random noise averages out across the network. Any individual station may have poor siting, but the statistical signal across thousands of stations in a region reflects actual atmospheric state. The network calibrates itself: when hundreds of stations in a region agree on a temperature trend, the outlier station’s bias becomes detectable without any external reference standard. This is the same principle that makes ensemble weather forecasting work—redundancy substitutes for individual precision.


5. Development Phases

Phase 1: Paired-Site Validation (Current–Near Term)

Sites: Canemah Nature Laboratory (Oregon City, OR) and Owl Farm (Bellingham, WA).

Data: Tempest weather station and BirdWeather acoustic monitor at Canemah; Ambient Weather station and BirdWeather at Owl Farm. Two years of Macroscope archive data at Canemah; new deployment at Bellingham. The fact that the two sites use different weather hardware platforms is a feature, not a limitation—it immediately tests whether the model learns ecological dynamics rather than platform-specific sensor characteristics.

Architecture: Extend SOMA from single-site RBMs to a two-node graph with shared hidden layers. Train a joint model that learns what is common to Pacific Northwest ecology (shared weather-biology coupling dynamics) and what is site-specific (different species pools, different coastal influence, different latitude).

Validation: Does the joint model outperform site-specific models at anomaly detection? Does knowledge transfer occur—does training on Canemah data improve predictions at Bellingham?

Compute: M4 Max laptop (Data), Mac Mini M4 Pro (Galatea). CPU and Metal-accelerated JAX.

Phase 2: Pacific Northwest Regional Cluster (Medium Term)

Sites: 20–50 existing weather stations (Tempest, Ambient Weather, Davis) and BirdWeather acoustic monitors in the Pacific Northwest, recruited from the operator communities. The region from Northern California to British Columbia offers strong ecological gradients (coast to mountains, maritime to continental, sea level to subalpine), dense monitoring coverage, and the highest iNaturalist observation density in North America.

Recruitment: The value proposition to station operators is direct: join the network, and the model returns contextualized anomaly detection for your site. Your weather station currently tells you the weather. The LSM tells you whether today’s weather-and-birds pattern is normal for this place and season—or whether something unusual is happening. This is the iNaturalist insight: people contribute when the system gives them something personally meaningful in return.

Data: Weather API access for all participating stations regardless of hardware platform. BirdWeather API for acoustic stations. iNaturalist observations within a defined radius of each node. No new hardware required at any participating site.

Architecture: Graph neural network with Pacific Northwest stations as nodes. Edges defined by ecological similarity derived from shared species detections and correlated weather patterns—learned, not prescribed. The model discovers the biogeographic structure of the region from data.

Validation: Cross-site prediction accuracy. Can the model predict tomorrow’s bird activity at one station given weather data from neighboring stations? Can it detect the arrival of an atmospheric river through its ecological signature before the weather models announce it? Does prediction accuracy improve as nodes are added—does the network effect hold?

Phase 3: Coast-Wide Expansion (Longer Term)

Sites: All willing weather station operators (across Tempest, Ambient Weather, and Davis platforms), BirdWeather stations, and iNaturalist observations along the Pacific Coast of North America, from San Diego to Juneau. Hundreds to thousands of nodes.

Data: Full API access to all weather station platforms, BirdWeather, and iNaturalist. Heterogeneous data streams from multiple hardware platforms unified through the graph architecture—the model learns to normalize across sensor differences, treating platform diversity as a source of robustness rather than noise.

Architecture: Scaled graph neural network with learned spatial embeddings. Stations connect to their regional neighbors. iNaturalist observations attach to the nearest geographic node as episodic phenological context.

Capabilities at scale:

Continental anomaly detection. The model knows what February should feel and sound like from San Diego to Bellingham. When spring arrives two weeks early in one region but not its neighbor, the model detects the spatial boundary of the phenological shift.

Climate-ecology coupling discovery. Cross-domain relationships emerge from training—the model discovers that Pacific Decadal Oscillation state modulates breeding chronology across the coast range, or that atmospheric river events trigger invertebrate emergence pulses that cascade through avian communities. These discoveries arise from the model’s learned weights, not from investigator hypotheses.

Absence and silence detection. Following SOMA’s demonstrated capability for absence-as-signal, the scaled model detects what should be present but is not. Missing species at expected phenological windows. Silent dawn choruses. Failed fruiting events. These absences are ecologically significant and invisible to presence-only monitoring.

Transfer learning across ecosystems. The deep layers encode general ecological dynamics that transfer across sites. The shallow layers encode local character. A new monitoring station—a consumer weather station and BirdWeather deployment—can join the network and begin receiving contextualized anomaly detection within days of deployment, bootstrapped by the foundation model’s general ecological knowledge.

Phase 4: Open Platform

Specification: Publish standardized sensor deployment protocols, data format specifications, API integration requirements, and model architecture documentation. Open-source the model and training pipeline.

Access: Any site with a consumer weather station (Tempest, Ambient Weather, or Davis—starting under $200) and a BirdWeather monitor can join the network. The foundation model fine-tunes on their local data. The graph grows. No institutional affiliation required. Participation is opt-in: each station owner authorizes their own data through their platform’s API credentials and receives ecological intelligence in return.

Social Growth: The project succeeds if it becomes socially interesting—if people talk about what the model is showing them the way they share iNaturalist observations or compare Tempest rainfall totals after a storm. The natural constituencies are:

Families. A household with a backyard weather station and a BirdWeather microphone receives a daily ecosystem report: what the model expected to hear this morning, what it actually heard, and whether anything was unusual. Children grow up watching the phenological calendar unfold through the model’s predictions—when should the first swallows arrive? Did the model predict today’s dawn chorus correctly? The station becomes a window into ecological process, not just a temperature readout.

Schools. A classroom deploys a weather station and acoustic monitor on school grounds and joins the network. Students track their local ecosystem against the model’s predictions, compare their site to others in the region, and investigate anomalies the model flags. The LSM provides what textbook ecology cannot: real-time, place-based ecological observation with continental context. A high school biology class in Portland can see how their schoolyard fits into a pattern stretching from San Diego to Bellingham.

Dedicated citizen scientists. Birders, naturalists, and environmental monitors who already collect systematic observations gain a tool that contextualizes their expertise. The experienced birder who noticed that the varied thrushes arrived late this year can see whether the model detected the same pattern, whether it extends across the region, and what weather conditions might explain it. Their field knowledge validates and refines the model; the model gives their observations continental reach.

Community: The existing communities—over half a million weather station operators, thousands of BirdWeather users, nearly 4 million iNaturalist observers—are not just data sources. They are field ecologists in the original sense: people who watch their places. They know when the first rufous hummingbird arrives, when the trilliums bloom, when the evening grosbeak stops coming to the feeder. The model learns from their instruments; they validate the model with their eyes and ears. This is citizen science not as data collection for institutional researchers, but as distributed ecological intelligence—each participant both contributor and beneficiary.

The Growth Mechanism: Network effects drive adoption. Each new station makes the model slightly better for every existing station, because the graph becomes denser and the learned ecological relationships become more robust. Early adopters in a region receive coarser context; as their neighbors join, the resolution sharpens. A family sees their neighbor’s ecosystem dashboard and asks how to set one up. A teacher presents student findings at a district meeting and three more schools deploy stations the following semester. A birding club adopts the platform and suddenly a county has twenty nodes where it had two. This is how iNaturalist grew. This is how citizen science scales.


6. Relationship to Existing Work

6.1 What Exists

Foundation models for remote sensing are under active development. NASA’s Prithvi, IBM’s GeospatialFM, and Clay have demonstrated pre-training on satellite imagery for land cover classification and change detection. The ESA-sponsored “Foundation Models for Climate and Society” initiative targets ice, drought, and flood-zone mapping.

Most significantly, BioAnalyst (Trantas et al. 2025) describes itself as “the first multimodal Foundation Model tailored to biodiversity analysis and conservation planning.” Using a Perceiver IO encoder and 3D Swin Transformer backbone, BioAnalyst ingests 10 data modalities—species occurrence records, remote sensing indicators, climate variables, and environmental covariates—across 20 years of European spatiotemporal data at 0.25° (approximately 28 km) resolution. The model demonstrates competence at joint species distribution modeling for 500 vascular plant species and monthly climate linear probing, establishing that foundation model architectures can indeed learn cross-domain ecological representations. BioAnalyst is open-source, with published weights and fine-tuning pipelines.

BirdCast, a collaboration between Cornell Lab of Ornithology, Colorado State University, and University of Massachusetts, uses radar data and machine learning to forecast nocturnal bird migration across the United States in real time. Foundation models for bioacoustics are also emerging, with multiple architectures demonstrating transfer learning for species classification from passive acoustic monitoring data.

The PNAS perspective “A synergistic future for AI and ecology” (2023) calls explicitly for convergence between ecological science and AI, noting that “challenges that are commonplace in multiscale, context-dependent, and imperfectly observed ecological systems offer a panoply of problems through which AI moves closer to realizing its full potential.”

6.2 What Does Not Exist—and Why It Matters

BioAnalyst represents an important advance, but it approaches ecological intelligence from above. Its data sources are satellite-derived remote sensing indices, gridded climate reanalysis products, and species occurrence records aggregated to 28-km grid cells at monthly temporal resolution. It learns correlations between what satellites see and what occurrence databases report. This is powerful for continental-scale conservation planning, habitat suitability assessment, and population trend forecasting—the tasks for which it was designed.

What BioAnalyst cannot do—what no existing system does—is learn from the ground up. A Tempest weather station measures actual conditions at a specific point every 60 seconds. A BirdWeather microphone hears actual birds vocalizing in real time—capturing behavioral signals like dawn chorus timing, raptor-induced silence, and nocturnal flight calls that no satellite or occurrence database records. An iNaturalist observer photographs an actual organism at an actual moment in its phenological cycle. The temporal resolution is minutes, not months. The spatial resolution is the individual station, not a 28-km grid cell. And critically, acoustic data encodes behavior—when species vocalize, when they fall silent, when their activity patterns shift—information that exists at no other observational scale.

The difference is the difference between reading about a forest and standing in one. BioAnalyst can model that species richness in grid cell X correlates with NDVI and mean annual temperature. The proposed LSM can model that the towhees went quiet at 8:15 AM when the barometer dropped 2 mb/hour and no raptor was detected—and learn that this silence carries ecological meaning.

No existing system learns the joint distribution over ground-truth atmospheric and biological variables from distributed sensor networks at the temporal and spatial resolution of individual monitoring stations. Remote sensing foundation models observe from above. Weather AI predicts atmospheric state but not ecological consequences. BirdCast forecasts migration volume from radar, not species-specific behavioral patterns from acoustic detection. BioAnalyst integrates biodiversity and climate but at coarse spatiotemporal resolution from aggregated records, not real-time sensor streams.

The proposed LSM occupies a fundamentally different position: ground-truth weather coupled with ground-truth biodiversity, learned jointly, at the resolution where ecological interactions actually occur. It would be the first model trained on the continuous sensor streams that field ecologists recognize as the actual fabric of ecosystem dynamics—the minute-by-minute pulse of weather and the hour-by-hour rhythm of biological activity at specific places. The goal is not to predict species ranges across Europe, but to predict what the ecosystem at a given monitoring station should sound and feel like tomorrow—and to detect, in real time, when it doesn’t.

6.3 The DeepMind Observation

The GraphCast developers noted that their technology could be extended to “climate and ecology, energy, agriculture, and human and biological activity.” BioAnalyst has begun to realize this vision from the satellite perspective. But the ground-truth sensor approach—learning ecosystem dynamics from the instruments that actually measure weather and hear birds at specific places in real time—has not been attempted. The reason is not technical—the architecture exists, the compute is accessible, the training data is available. The reason is that it requires someone who understands both the ecological systems and the machine learning architecture, and who has access to the institutional relationships necessary to integrate research-grade and crowdsourced data networks.


7. Technical Requirements

7.1 Compute

Phases 1–2 operate within the capacity of Apple Silicon hardware with Metal acceleration. The M4 Max with 128GB unified memory supports models with thousands of input dimensions and millions of parameters. Phase 3 may require cloud compute for initial training (comparable to GraphCast’s training requirements scaled down by the ratio of spatial nodes), but inference—the ongoing monitoring operation—runs on modest hardware.

7.2 Software

JAX provides the computational framework, consistent with both SOMA’s existing implementation and GraphCast’s open-source codebase. Flax or Haiku provide neural network building blocks including attention mechanisms. The entire stack runs on the same platform.

7.3 Data Access

All platforms provide API access. Tempest offers REST and WebSocket APIs for real-time and historical data. Ambient Weather provides RESTful and real-time APIs through AmbientWeather.net. Davis WeatherLink v2 API supports current conditions and historical data for all connected stations. BirdWeather provides open APIs for detection data. iNaturalist supports programmatic access to research-grade observations with geographic and taxonomic filtering.

7.4 No New Hardware

This is the proposal’s most distinctive feature. The training corpus exists. The sensors are deployed. The APIs are live. The compute fits on a desk. The only infrastructure that must be built is the model itself—and the community of participants who choose to connect their stations. Data access is opt-in: each station owner authorizes their own data stream and receives ecological intelligence in return. The network grows through demonstrated value, not institutional mandate.


8. Expected Outcomes

A trained Large Sensor Model for ecological perception would produce:

Continental-scale anomaly detection. Real-time identification of ecosystem departures from learned expectations, decomposed by domain, timescale, and spatial extent. Early warning for phenological shifts, population crashes, invasive species establishment, and climate regime transitions.

Discovered ecological relationships. Cross-domain couplings encoded in the model’s learned weights, discoverable through attribution analysis. Relationships that field ecologists suspected but could not quantify, and relationships that no one anticipated.

Predictive ecological state. Next-day, next-week, next-season predictions of biodiversity activity at every monitored location. Not weather forecasting, but ecosystem forecasting—what the landscape should feel and sound like tomorrow.

Scalable citizen science integration. A framework that turns every consumer weather station and every iNaturalist observation into a node in a continental perception system. The value proposition is bidirectional: each participant’s data improves the model for everyone, and the model returns contextualized ecological insight to each participant. Families, schools, and dedicated naturalists gain a window into ecological process at their own place, with continental context. The network grows because participating is more interesting than not participating.

A new paradigm for ecological observation. The transition from representational monitoring (measuring, storing, comparing) to embodied monitoring (learning, predicting, perceiving)—not managed by institutions, but grown by communities of observers who want to understand their own places.


9. The Personal Dimension

This proposal emerges from a specific intellectual trajectory. Thirty-six years of directing UC biological field stations, from the James San Jacinto Mountains Reserve to Blue Oak Ranch, taught me that ecological perception is fundamentally cross-domain—you cannot understand the birds without understanding the weather, the soil, the season, the history. The CENS era (2002–2012) demonstrated that distributed sensor networks could capture ecological dynamics at scales impossible for human observers. The Macroscope synthesizes these lessons into a personal research observatory.

Those decades also taught me something about institutions. The best ecological monitoring I have witnessed happens when people pay attention to their own places—when a naturalist knows the phenological calendar of her watershed, when a birder notices a shift in dawn chorus timing because he has been listening every morning for years. Institutional networks produce standardized data; grassroots observers produce ecological intelligence. The difference matters.

SOMA proved the concept at the scale of one backyard. The question is whether the same architecture—learning the statistical structure of how ecosystems behave from observation data alone—works when the backyards number in the thousands. The infrastructure says yes. The mathematics says yes. GraphCast says yes for the atmosphere. And the history of citizen science—from Christmas Bird Counts to iNaturalist—says that people will build the network themselves if you give them a reason to.

The idea crystallized on the drive from Oregon City to Bellingham on February 12, 2026, connecting three threads that had been developing independently: the thermodynamic sensing framework, the GraphCast precedent, and the realization that the training data already exists in the crowdsourced networks I had been drawing from for my own backyard. This proposal is an attempt to capture that convergence before the threads separate again.


References

  • Hamilton, M.P. (2026). “Embodied Ecological Sensing via Thermodynamic Models.” Canemah Nature Laboratory Technical Note CNL-TN-2026-014. https://canemah.org/archive/document.php?id=CNL-TN-2026-014
  • Lam, R. et al. (2023). “GraphCast: Learning skillful medium-range global weather forecasting.” Science. https://doi.org/10.1126/science.adi2336
  • Price, I. et al. (2023). “GenCast: Diffusion-based ensemble forecasting for medium-range weather.” arXiv:2312.15796.
  • Jelinčič, A. et al. (2025). “An efficient probabilistic hardware architecture for diffusion-like models.” arXiv:2510.23972.
  • Wolpert, D.H. et al. (2024). “Is stochastic thermodynamics the key to understanding the energy costs of computation?” Proceedings of the National Academy of Sciences, 121(45), e2321112121.
  • Hinton, G.E. (2012). “A Practical Guide to Training Restricted Boltzmann Machines.” Neural Networks: Tricks of the Trade, Springer.
  • Ravi, S. et al. (2023). “A synergistic future for AI and ecology.” Proceedings of the National Academy of Sciences. https://doi.org/10.1073/pnas.2220283120
  • Trantas, A. et al. (2025). “BioAnalyst: A Foundation Model for Biodiversity.” arXiv:2507.09080. https://arxiv.org/abs/2507.09080

End of Research Proposal

Cite This Document

(2026). "Toward a Large Sensor Model for Ecological Perception." Canemah Nature Laboratory Working Paper CNL-WP-2026-022. https://canemah.org/archive/CNL-WP-2026-022

BibTeX

@unpublished{cnl2026toward, author = {}, title = {Toward a Large Sensor Model for Ecological Perception}, institution = {Canemah Nature Laboratory}, year = {2026}, number = {CNL-WP-2026-022}, month = {february}, url = {https://canemah.org/archive/document.php?id=CNL-WP-2026-022}, abstract = {Google DeepMind’s GraphCast demonstrated that machine learning models trained on historical atmospheric data can outperform physics-based weather forecasting on 90\% of verification targets. BioAnalyst (Trantas et al. 2025) demonstrated that multimodal foundation models can learn joint species-climate representations from satellite and occurrence data at continental scale. SOMA (Stochastic Observatory for Mesh Awareness), operating at Canemah Nature Laboratory, demonstrated that energy-based models trained on ecological sensor data can detect cross-domain anomalies invisible to single-domain analysis. This proposal argues that these results converge on an unrealized architectural target: a Large Sensor Model (LSM) for multi-domain ecological perception—a foundation model trained not on gridded reanalysis products or satellite imagery but on continuous ground-truth environmental sensor streams, learning the joint probability distribution over atmospheric and biological variables at the temporal and spatial resolution where ecological interactions actually occur. Critically, the infrastructure to train such a model already exists. Consumer weather station networks—WeatherFlow-Tempest, Ambient Weather, and Davis Instruments—collectively operate over half a million standardized stations worldwide with API access. BirdWeather maintains approximately 2,000 active acoustic monitoring stations running BirdNET species classification. iNaturalist has accumulated over 250 million verifiable observations—170 million at research grade—spanning all major taxonomic groups with geolocated timestamps. These platforms collectively provide the training corpus for a foundation model of ecosystem dynamics, requiring no new hardware deployment. We propose a phased development path: beginning with paired-site validation (Canemah, Oregon and Bellingham, Washington), expanding through regional recruitment of existing weather station and BirdWeather operators in the Pacific Northwest, and scaling organically as the network demonstrates value—following the grassroots trajectory that built iNaturalist from a graduate student project into a global biodiversity platform with 4 million observers. The resulting model would learn weather-biodiversity coupling across ecological gradients, producing predictions of ecosystem state whose deviations from observation constitute an anomaly detection system of unprecedented scope.} }

Permanent URL: https://canemah.org/archive/document.php?id=CNL-WP-2026-022