Semantic Scholar API Integration for the Macroscope Collaboratory
Semantic Scholar API Integration for the Macroscope Collaboratory
Document ID: CNL-FN-2026-048 Date: April 11, 2026 Author: Michael P. Hamilton, Ph.D. Affiliation: Canemah Nature Laboratory, Oregon City, Oregon
AI Assistance Disclosure: This field note was developed collaboratively with Claude (Anthropic, claude-opus-4-6) via claude.ai. Claude contributed to API documentation review, architectural analysis, and document drafting. The author takes full responsibility for the content, accuracy, and conclusions.
Abstract
An API key for the Semantic Scholar Academic Graph was obtained on April 11, 2026, providing authenticated access to Allen AI’s corpus of approximately 200 million scholarly papers. This field note documents the API’s capabilities and sketches the integration path into the Macroscope Collaboratory (CNL-TN-2026-047 v0.3) as a literature_search worker. The integration addresses a gap identified during STR-001 first-light testing: the Collaboratory’s Priors step (Step 3 in the ten-step form model) currently has no systematic mechanism for surveying the peer-reviewed literature relevant to an investigation. Semantic Scholar’s keyword search, citation graph traversal, paper recommendations, and AI-generated TLDR summaries provide exactly this capability, complementing the web_search worker’s coverage of grey literature, agency reports, and news sources.
1. Context
CNL-TN-2026-047 v0.3 documented the architectural pivot of the Collaboratory from an in-process PHP wizard calling the Anthropic API to an MCP-hosted investigation framework with Claude Desktop as the LLM host. The ten-step form model places literature survey work in Step 3 (Priors), where the investigator and Claude collaborate to establish what is already known about the investigation question. The web_search worker scaffolded in v0.3 covers one half of this need — web-accessible reports, datasets, and informal sources. The other half — peer-reviewed journals, conference proceedings, preprints — requires a structured academic search API.
Semantic Scholar, developed by the Allen Institute for AI (Ai2) in Seattle, is the strongest candidate for this role. It is free, open, well-documented, covers all disciplines, and provides features (citation graph traversal, recommendations, TLDR summaries) that go well beyond simple keyword search. An API key was requested and approved on April 11, 2026, providing authenticated access at 1 request per second across all endpoints.
2. API Capabilities
Semantic Scholar exposes three APIs, each with a distinct base URL.
2.1 Academic Graph API
Base URL: https://api.semanticscholar.org/graph/v1
This is the primary workhorse. Key endpoints:
Paper Bulk Search (/paper/search/bulk) — Keyword search using Semantic Scholar’s custom-trained relevance ranker. Supports filtering by year range, publication type, fields of study, venue, minimum citation count, and open-access PDF availability. Returns paginated results with a token-based continuation scheme. This is the recommended search endpoint for most use cases; the alternative Paper Relevance Search endpoint is more resource-intensive but can return richer per-paper author and citation detail.
Paper Details (/paper/{paper_id}) — Retrieve metadata for a single paper by Semantic Scholar ID, DOI, ArXiv ID, or other external identifiers. The fields query parameter controls which attributes are returned. Available fields include title, abstract, year, venue, authors, citation count, reference count, open-access PDF URL, TLDR (AI-generated summary), fields of study, publication types, and external IDs.
Paper Citations (/paper/{paper_id}/citations) — Forward citation walk: who has cited this paper? Returns paginated lists of citing papers with configurable field depth. Essential for “what has happened since this paper was published?” queries.
Paper References (/paper/{paper_id}/references) — Backward citation walk: what does this paper cite? Same paginated structure. Essential for tracing intellectual lineage.
Paper Batch (POST /paper/batch) — Retrieve details for up to 500 papers in a single request by submitting a list of IDs. Efficient for hydrating a set of papers discovered through search or citation traversal.
Author Search (/author/search) — Find authors by name. Returns author IDs, names, affiliated institutions, and publication counts.
Author Details (/author/{author_id}) — Retrieve an author’s profile including papers, citation counts, and h-index.
2.2 Recommendations API
Base URL: https://api.semanticscholar.org/recommendations/v1
Given one or more seed papers, returns papers that Semantic Scholar considers semantically related. This is distinct from citation graph traversal — it uses embedding similarity rather than explicit citation links, so it can surface relevant work that neither cites nor is cited by the seed paper.
2.3 Datasets API
Base URL: https://api.semanticscholar.org/datasets/v1
Provides download links for bulk corpus snapshots and incremental diffs between releases. Useful for organizations hosting their own copy of the Semantic Scholar corpus. Not relevant to the Collaboratory at present but worth noting for future scale.
3. Integration Architecture
3.1 The literature_search Worker
The integration follows the same worker framework pattern established in CNL-TN-2026-047. A new worker at Projects/Live/workers/workers/literature_search/worker.py will implement the worker contract (argv[1] = job path, exit 0 with .result.json sibling) and accept payloads with two operating modes:
Search mode. The payload provides one or more keyword queries with optional filters (year range, fields of study, minimum citation count, open-access only). The worker calls the Paper Bulk Search endpoint, retrieves results, optionally hydrates each result with TLDR and abstract via the Paper Batch endpoint, and returns a structured JSON result containing paper metadata and source URLs.
Explore mode. The payload provides one or more paper identifiers (DOI, S2 paper ID, or ArXiv ID) and a traversal directive: citations (forward), references (backward), recommendations (semantic similarity), or details (metadata only). The worker performs the requested traversal, respecting the 1 req/sec rate limit with appropriate sleep intervals, and returns structured results.
Both modes produce a result object with two components: a sources array of structured paper records (id, title, authors, year, venue, citation count, abstract, tldr, pdf_url) and a narrative text block summarizing the findings in prose suitable for insertion into a Priors field. The narrative generation can be done locally by Ollama or deferred to Claude Desktop via the MCP channel; the initial implementation will return the structured data and let Claude Desktop handle the synthesis.
3.2 Credential Management
The S2 API key is stored in the existing credential file at ~/Sites-secure/collaboratory/credentials.json, adding an s2_api_key field alongside the existing ollama_api_key. The file remains mode 0600, readable only by the user account.
3.3 Rate Limiting
The authenticated rate limit is 1 request per second cumulative across all endpoints. The worker must enforce this internally with a minimum 1.0-second sleep between API calls. For a typical Priors survey (one bulk search returning 20 papers, followed by a batch details call for TLDRs), the total wall time is approximately 3 seconds — well within the worker framework’s tolerance for long-running jobs.
Citation graph traversal is more intensive. Walking two hops of forward citations for a well-cited paper could require dozens of requests. The worker should cap traversal depth and result count to keep individual jobs under 60 seconds.
3.4 MCP Exposure
The existing Collaboratory MCP server gains no new tools for the literature search; the worker is dispatched and polled through the existing dispatch_worker_job, check_job_status, and get_job_result tool chain. Claude Desktop requests a literature search by dispatching a job with job_type: "literature_search" and the appropriate payload. This is the same pattern used for web_search and echo.
4. Relationship to the Ten-Step Model
The literature_search worker serves primarily two steps in the investigation workflow:
Step 3 (Priors). This is the primary consumer. When Claude Desktop is helping the investigator survey existing knowledge, it can dispatch both web_search and literature_search jobs, then synthesize results from grey literature and peer-reviewed sources into a unified Priors narrative. The structured paper records provide proper citations that flow through to Step 10 (Publication).
Step 5 (Collection). Some investigations may require pulling specific papers or datasets identified during Priors. The explore mode’s ability to retrieve open-access PDF URLs supports this, though the actual PDF retrieval and text extraction would be a separate worker concern.
The citation graph capabilities also support Step 9 (Reflections), where the investigator and Claude consider the broader context of the investigation’s findings. Walking the citation neighborhood of key papers cited in the investigation can reveal related work that contextualizes the results.
5. The STRATA Context Triad
This integration completes a three-source pattern for the Priors step that parallels the STRATA context architecture documented in CNL-TN-2026-045:
| Context Layer | Source | Worker | Content |
|---|---|---|---|
| Place | macroscope_nexus.places, lookup_cache |
(direct DB query) | Geographic, ecological, and climatic priors for the investigation site |
| Instrument | macroscope.sensor_platforms |
(direct DB query) | Available sensors, calibration history, data quality metadata |
| Knowledge | Semantic Scholar + web | literature_search, web_search |
Published literature, grey literature, agency data, news |
The place and instrument layers were already operational in the v0.2 wizard and prevented the geographic hallucination documented in CNL-TN-2026-045. The knowledge layer was partially served by the web_search scaffold but lacked systematic access to the peer-reviewed literature. The Semantic Scholar integration fills this gap.
6. Attribution Requirements
Per the Semantic Scholar license agreement, any public-facing use of their data requires attribution and a citation to their Open Data Platform paper:
Kinney, R., Anastasiades, C., Authur, R., et al. (2023). “The Semantic Scholar Open Data Platform.” arXiv:2301.10140.
The Collaboratory UI (Phase D) should include a “Data source: Semantic Scholar API” attribution line on any page displaying literature search results.
7. Forward Work
Implementation of the literature_search worker fits into Phase A of the CNL-TN-2026-047 forward plan, alongside the web_search worker. The two workers share the same credential file, the same dispatch pattern, and the same result-retrieval path. They can be built and tested in parallel.
The immediate next step is to write and test the worker script, starting with a simple bulk search query against a known topic (e.g., “diurnal temperature range riparian” with a 2020-2026 year filter) and verifying that the round trip through the worker framework produces a well-formed .result.json.
Document History
| Version | Date | Changes |
|---|---|---|
| 0.1 | 2026-04-11 | Initial draft. API capabilities survey, integration architecture, relationship to Collaboratory ten-step model and STRATA context triad. |
Cite This Document
BibTeX
Permanent URL: https://canemah.org/archive/document.php?id=CNL-FN-2026-048