Implementation:Testtimescaling Testtimescaling github io Paper Taxonomy Analysis
| Type | Pattern Doc (manual analysis process) |
|---|---|
| Source | README.md:L47-73 (taxonomy reference)
|
| Domains | Research_Methodology, Academic_Survey |
| Last Updated | 2026-02-14 |
Overview
A structured manual analysis process for classifying a paper across all dimensions of the test-time scaling taxonomy, producing column values for the comparison table.
Description
This pattern documents the analytical process a contributor follows to classify a paper that has already passed relevance screening. The output is a complete set of taxonomy values that will populate one row in the comparison table.
The classification requires careful reading of the paper's methodology, experimental setup, and evaluation sections. Each of the nine classification fields (What, SFT, RL, STI, SEA, VER, AGG, Where, How Well) must be assigned either a specific value or marked as not applicable.
The process involves:
- Determine the "What" dimension: Read the paper's method description and identify whether it uses Parallel generation, Sequential reasoning, a Hybrid of both, or Internal (model-autonomous) compute allocation.
- Classify SFT: Determine if the method involves supervised fine-tuning. If yes, note the specific approach (e.g., distillation, instruction tuning). If not applicable, mark as
✗. - Classify RL: Determine if reinforcement learning is used. If yes, note the specific algorithm (GRPO, DPO, PPO, REINFORCE, etc.). If not applicable, mark as
✗. - Classify STI (Stimulation): Identify if the method uses prompting or stimulation techniques such as Chain-of-Thought (CoT), Self-Refine, Budget Forcing, Think-then-Respond, or similar. Mark as
✗if not applicable. - Classify SEA (Search): Identify if the method employs search algorithms such as Monte Carlo Tree Search (MCTS), Beam Search, Best-First Search, or similar. Mark as
✗if not applicable. - Classify VER (Verification): Identify if the method uses verification or reward models such as Process Reward Models (PRM), Outcome Reward Models (ORM), Self-Evaluation, or similar. Mark as
✗if not applicable. - Classify AGG (Aggregation): Identify if the method aggregates multiple outputs using Best-of-N, Majority Voting, Weighted Fusion, or similar. Mark as
✗if not applicable. - Determine "Where": List all evaluation domains from the paper's experiments: Math, Code, Sci (Science), Game, Basics, Agents, Knowledge, Open-Ended, Multi-Modal.
- Determine "How Well": List the evaluation metrics used: Pass@1, Accuracy, Win Rate, BLEU, or any other reported metrics.
Usage
Apply this classification process to every paper that passes the relevance screening step. The analysis typically requires 15-30 minutes per paper for thorough classification. When in doubt about a classification, refer to the taxonomy definitions in the survey paper (arXiv: 2503.24235) and to existing entries in the comparison table for precedent.
Code Reference
Source Location
The taxonomy reference is defined in the repository at README.md:L47-73. The taxonomy categories and their definitions originate from the survey paper.
Interface Specification
TAXONOMY CLASSIFICATION INTERFACE
===================================
Input:
- paper: {
title: string,
arxiv_id: string,
full_text: accessible paper content (PDF or HTML)
}
Process:
For each field, assign one of:
- A specific value (method name, domain, metric)
- A combination of values (comma-separated)
- "✗" (not applicable)
Fields:
1. what: "Parallel" | "Sequential" | "Hybrid" | "Internal" | combination
2. sft: specific_method | "✗"
3. rl: "GRPO" | "DPO" | "PPO" | "REINFORCE" | other | "✗"
4. sti: "CoT" | "Self-Refine" | "Budget Forcing" | other | "✗"
5. sea: "MCTS" | "Beam" | "Best-First" | other | "✗"
6. ver: "PRM" | "ORM" | "Self-Evaluate" | other | "✗"
7. agg: "Best-of-N" | "Majority Vote" | "Fusion" | other | "✗"
8. where: list of { "Math" | "Code" | "Sci" | "Game" | "Basics" |
"Agents" | "Knowledge" | "Open-Ended" | "Multi-Modal" }
9. how_well: list of metric names (e.g., "Pass@1", "Accuracy", "Win Rate")
Output:
- classification: {
what: string,
sft: string,
rl: string,
sti: string,
sea: string,
ver: string,
agg: string,
where: string,
how_well: string
}
Import
No imports required. This is a manual analytical process. The taxonomy reference should be consulted at README.md:L47-73 in the repository.
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
| paper_title | String | Yes | Title of the paper being classified |
| arxiv_id | String | Yes | arXiv identifier in format XXXX.XXXXX
|
| full_text | Paper content | Yes | Access to the full paper for reading methodology and evaluation sections |
Outputs
| Output | Type | Description |
|---|---|---|
| what | String | Scaling strategy: Parallel, Sequential, Hybrid, Internal, or combination |
| sft | String | Supervised fine-tuning method or ✗
|
| rl | String | Reinforcement learning algorithm or ✗
|
| sti | String | Stimulation/prompting method or ✗
|
| sea | String | Search algorithm or ✗
|
| ver | String | Verification method or ✗
|
| agg | String | Aggregation method or ✗
|
| where | String | Comma-separated list of evaluation domains |
| how_well | String | Comma-separated list of evaluation metrics |
Usage Examples
Example 1: Classifying a search-based reasoning paper
Paper: "Scaling LLM Test-Time Compute Optimally..."
arXiv ID: 2408.03314
Analysis:
- Method description: Uses process reward models to guide tree search
over reasoning steps, with best-of-N aggregation.
- Training: No fine-tuning mentioned → SFT: ✗, RL: ✗
- Inference: Uses verification (PRM) and search (beam-like)
with aggregation (Best-of-N)
Classification result:
what: "Sequential"
sft: "✗"
rl: "✗"
sti: "✗"
sea: "Beam"
ver: "PRM"
agg: "Best-of-N"
where: "Math"
how_well: "Pass@1, Accuracy"
Example 2: Classifying an RL-trained reasoning model
Paper: "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL"
arXiv ID: 2501.12948
Analysis:
- Method description: Uses GRPO reinforcement learning to train
extended chain-of-thought reasoning. Also uses SFT for cold start.
- Training: SFT (cold start), RL (GRPO)
- Inference: Stimulation (long CoT), no explicit search or verification
Classification result:
what: "Sequential"
sft: "Cold Start SFT"
rl: "GRPO"
sti: "Long CoT"
sea: "✗"
ver: "✗"
agg: "✗"
where: "Math, Code, Sci"
how_well: "Pass@1, Accuracy"