Principle:Testtimescaling Testtimescaling github io Paper Relevance Assessment
| Knowledge Sources | Academic literature review methodology, systematic review protocols |
|---|---|
| Domains | Research_Methodology, Academic_Survey |
| Last Updated | 2026-02-14 |
Overview
Systematic literature screening ensures that a curated academic survey maintains focus, quality, and completeness by evaluating candidate papers against a well-defined scope.
Description
Paper Relevance Assessment is the disciplined process of discovering candidate papers and determining whether they fall within the defined scope of a survey. For the test-time scaling survey, this means evaluating whether a paper addresses techniques, methods, or evaluations related to scaling computation at inference time in Large Language Models (LLMs).
The process draws from established systematic review methodology used in academic research. Rather than ad-hoc inclusion, each candidate paper is subjected to a structured screening protocol that checks alignment with the survey's research questions and taxonomic framework. This ensures the survey remains comprehensive yet focused, avoiding both scope creep and gaps in coverage.
The screening process operates as a funnel:
- Source Identification: Monitor primary sources including arXiv preprints, Semantic Scholar alerts, major ML/NLP conference proceedings (NeurIPS, ICML, ICLR, ACL, EMNLP), and citations within already-included papers.
- Relevance Screening: Evaluate whether the paper directly addresses test-time scaling in LLMs. This includes papers on chain-of-thought reasoning, search-augmented generation, self-refinement, verification methods, and compute-optimal inference strategies.
- Scope Fit: Determine whether the paper can be meaningfully classified within the survey's four-dimension taxonomy: What to Scale, How to Scale, Where to Scale, and How Well to Scale.
- Decision Output: Produce a binary accept/reject decision. For accepted papers, extract the arXiv ID in the standard format (
XXXX.XXXXX) for downstream processing.
Usage
Use this principle when a new paper is discovered that may relate to test-time scaling and needs evaluation for inclusion in the survey. This is the first step in the Adding_a_New_Paper workflow and gates all subsequent steps. A paper that does not pass relevance assessment should not proceed to taxonomy classification, table entry, or citation registration.
Typical triggers for this step include:
- A new arXiv preprint appears in relevant categories (cs.CL, cs.AI, cs.LG)
- A colleague or community member suggests a paper for inclusion
- A citation chain from an existing included paper leads to a new candidate
- A conference publishes proceedings with potentially relevant work
Theoretical Basis
Literature review methodology is grounded in systematic review protocols originally developed for evidence-based medicine (e.g., PRISMA guidelines) and adapted for computer science surveys. The key principles are:
Reproducibility: The screening criteria must be explicit enough that different reviewers would reach the same inclusion/exclusion decision for the same paper. For this survey, the criteria are:
- The paper must address LLMs (not classical ML or small models)
- The paper must involve computation scaling at inference/test time (not training-time scaling)
- The paper must present a method, evaluation, or analysis that fits at least one cell in the What/How/Where/How Well taxonomy
Completeness: Multiple discovery channels are monitored to minimize the chance of missing relevant work. No single source (e.g., only arXiv) is sufficient for comprehensive coverage.
Consistency: The same screening criteria are applied uniformly to all candidate papers, preventing bias toward particular sub-topics, authors, or institutions.
Traceability: Each screening decision (accept or reject) should be justifiable by reference to the defined criteria, enabling later audit or re-evaluation if the survey scope evolves.
The four-dimension taxonomy from "What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models" (arXiv: 2503.24235) serves as the operational definition of scope. A paper is in-scope if it can be meaningfully classified along at least the What and How dimensions.