Principle:Testtimescaling Testtimescaling github io Citation Registration
| Knowledge Sources | JSON data management, CI/CD pipeline design, Semantic Scholar API |
|---|---|
| Domains | Data_Management, Automation |
| Last Updated | 2026-02-14 |
Overview
Registering academic papers in a structured JSON registry enables automated citation tracking pipelines to fetch and aggregate citation counts over time.
Description
Citation Registration is the process of adding a paper's metadata to the repository's citation tracking system. This system consists of a JSON registry file that stores paper identifiers and a GitHub Actions workflow that periodically queries the Semantic Scholar API to retrieve current citation counts.
The registration step bridges the gap between manual paper curation (Steps 1-3) and automated data maintenance. Once a paper is registered, its citation count is automatically updated without further human intervention.
However, the current architecture of this repository has an important complexity: there are three locations that must be updated for a new paper to be fully registered in the citation tracking system:
- Root
papers.json: The primary registry file at the repository root containing an array of paper objects withtitleandarxiv_idfields. - Workflow
papers.json: A duplicate copy at.github/scripts/papers.jsonthat exists alongside the automation scripts. - Python script hardcoded IDs: The automation script at
.github/scripts/update_arxiv_citations.pycontains a hardcoded list of arXiv IDs (approximately lines 22-25) that it iterates over to fetch citations. This script does not read from eitherpapers.jsonfile.
This triple-update requirement is a known technical debt issue. The fundamental design principle is that all paper identifiers must be synchronized across all three locations; failure to update any one of them results in incomplete citation tracking.
Usage
Use this principle after adding the paper to the comparison table (Step 3). Citation registration is Step 4 of the Adding_a_New_Paper workflow. The contributor needs only the paper title and arXiv ID, both of which were determined in Steps 1 and 2.
Theoretical Basis
Citation registration follows principles from data pipeline design and configuration management:
Single source of truth (aspirational): In an ideal architecture, there would be one authoritative registry of papers, and all consumers (scripts, workflows, badges) would read from that single source. The current architecture deviates from this ideal by maintaining multiple copies of the paper list, creating a synchronization burden. Understanding this gap is important for contributors to avoid partial updates.
Registry pattern: The JSON file acts as a registry -- a central catalog of entities (papers) with their identifiers. The registry pattern is common in systems that need to enumerate and iterate over a known set of items. Each registry entry contains the minimal information needed for identification: a human-readable title and a machine-usable arXiv ID.
Idempotent updates: Adding a paper that is already in the registry should be a no-op (or produce an identical result). The JSON structure (array of objects) makes duplicate detection straightforward by checking the arxiv_id field.
Pipeline decoupling: By separating registration (human action) from citation fetching (automated action), the system decouples the rate of paper addition from the rate of citation updates. Papers can be added at any time, and the next scheduled workflow run will pick them up. This is a standard pattern in event-driven pipeline architectures.
Consistency requirement: The most critical aspect of this registration is maintaining consistency across all three update locations. An inconsistency (e.g., a paper in papers.json but not in the Python script) will result in the paper being silently excluded from citation tracking. No automated validation currently exists to detect such inconsistencies.