Implementation:Testtimescaling Testtimescaling github io Paper Taxonomy Analysis

Type	Pattern Doc (manual analysis process)
Source	`README.md:L47-73` (taxonomy reference)
Domains	Research_Methodology, Academic_Survey
Last Updated	2026-02-14

Overview

A structured manual analysis process for classifying a paper across all dimensions of the test-time scaling taxonomy, producing column values for the comparison table.

Description

This pattern documents the analytical process a contributor follows to classify a paper that has already passed relevance screening. The output is a complete set of taxonomy values that will populate one row in the comparison table.

The classification requires careful reading of the paper's methodology, experimental setup, and evaluation sections. Each of the nine classification fields (What, SFT, RL, STI, SEA, VER, AGG, Where, How Well) must be assigned either a specific value or marked as not applicable.

The process involves:

Determine the "What" dimension: Read the paper's method description and identify whether it uses Parallel generation, Sequential reasoning, a Hybrid of both, or Internal (model-autonomous) compute allocation.
Classify SFT: Determine if the method involves supervised fine-tuning. If yes, note the specific approach (e.g., distillation, instruction tuning). If not applicable, mark as ✗.
Classify RL: Determine if reinforcement learning is used. If yes, note the specific algorithm (GRPO, DPO, PPO, REINFORCE, etc.). If not applicable, mark as ✗.
Classify STI (Stimulation): Identify if the method uses prompting or stimulation techniques such as Chain-of-Thought (CoT), Self-Refine, Budget Forcing, Think-then-Respond, or similar. Mark as ✗ if not applicable.
Classify SEA (Search): Identify if the method employs search algorithms such as Monte Carlo Tree Search (MCTS), Beam Search, Best-First Search, or similar. Mark as ✗ if not applicable.
Classify VER (Verification): Identify if the method uses verification or reward models such as Process Reward Models (PRM), Outcome Reward Models (ORM), Self-Evaluation, or similar. Mark as ✗ if not applicable.
Classify AGG (Aggregation): Identify if the method aggregates multiple outputs using Best-of-N, Majority Voting, Weighted Fusion, or similar. Mark as ✗ if not applicable.
Determine "Where": List all evaluation domains from the paper's experiments: Math, Code, Sci (Science), Game, Basics, Agents, Knowledge, Open-Ended, Multi-Modal.
Determine "How Well": List the evaluation metrics used: Pass@1, Accuracy, Win Rate, BLEU, or any other reported metrics.

Usage

Apply this classification process to every paper that passes the relevance screening step. The analysis typically requires 15-30 minutes per paper for thorough classification. When in doubt about a classification, refer to the taxonomy definitions in the survey paper (arXiv: 2503.24235) and to existing entries in the comparison table for precedent.

Code Reference

Source Location

The taxonomy reference is defined in the repository at README.md:L47-73. The taxonomy categories and their definitions originate from the survey paper.

Interface Specification

TAXONOMY CLASSIFICATION INTERFACE
===================================

Input:
  - paper: {
      title: string,
      arxiv_id: string,
      full_text: accessible paper content (PDF or HTML)
    }

Process:
  For each field, assign one of:
    - A specific value (method name, domain, metric)
    - A combination of values (comma-separated)
    - "✗" (not applicable)

  Fields:
    1. what:     "Parallel" | "Sequential" | "Hybrid" | "Internal" | combination
    2. sft:      specific_method | "✗"
    3. rl:       "GRPO" | "DPO" | "PPO" | "REINFORCE" | other | "✗"
    4. sti:      "CoT" | "Self-Refine" | "Budget Forcing" | other | "✗"
    5. sea:      "MCTS" | "Beam" | "Best-First" | other | "✗"
    6. ver:      "PRM" | "ORM" | "Self-Evaluate" | other | "✗"
    7. agg:      "Best-of-N" | "Majority Vote" | "Fusion" | other | "✗"
    8. where:    list of { "Math" | "Code" | "Sci" | "Game" | "Basics" |
                           "Agents" | "Knowledge" | "Open-Ended" | "Multi-Modal" }
    9. how_well: list of metric names (e.g., "Pass@1", "Accuracy", "Win Rate")

Output:
  - classification: {
      what: string,
      sft: string,
      rl: string,
      sti: string,
      sea: string,
      ver: string,
      agg: string,
      where: string,
      how_well: string
    }

Import

No imports required. This is a manual analytical process. The taxonomy reference should be consulted at README.md:L47-73 in the repository.

I/O Contract

Inputs

Parameter	Type	Required	Description
paper_title	String	Yes	Title of the paper being classified
arxiv_id	String	Yes	arXiv identifier in format `XXXX.XXXXX`
full_text	Paper content	Yes	Access to the full paper for reading methodology and evaluation sections

Outputs

Output	Type	Description
what	String	Scaling strategy: Parallel, Sequential, Hybrid, Internal, or combination
sft	String	Supervised fine-tuning method or `✗`
rl	String	Reinforcement learning algorithm or `✗`
sti	String	Stimulation/prompting method or `✗`
sea	String	Search algorithm or `✗`
ver	String	Verification method or `✗`
agg	String	Aggregation method or `✗`
where	String	Comma-separated list of evaluation domains
how_well	String	Comma-separated list of evaluation metrics

Usage Examples

Example 1: Classifying a search-based reasoning paper

Paper: "Scaling LLM Test-Time Compute Optimally..."
arXiv ID: 2408.03314

Analysis:
  - Method description: Uses process reward models to guide tree search
    over reasoning steps, with best-of-N aggregation.
  - Training: No fine-tuning mentioned → SFT: ✗, RL: ✗
  - Inference: Uses verification (PRM) and search (beam-like)
    with aggregation (Best-of-N)

Classification result:
  what:     "Sequential"
  sft:      "✗"
  rl:       "✗"
  sti:      "✗"
  sea:      "Beam"
  ver:      "PRM"
  agg:      "Best-of-N"
  where:    "Math"
  how_well: "Pass@1, Accuracy"

Example 2: Classifying an RL-trained reasoning model

Paper: "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL"
arXiv ID: 2501.12948

Analysis:
  - Method description: Uses GRPO reinforcement learning to train
    extended chain-of-thought reasoning. Also uses SFT for cold start.
  - Training: SFT (cold start), RL (GRPO)
  - Inference: Stimulation (long CoT), no explicit search or verification

Classification result:
  what:     "Sequential"
  sft:      "Cold Start SFT"
  rl:       "GRPO"
  sti:      "Long CoT"
  sea:      "✗"
  ver:      "✗"
  agg:      "✗"
  where:    "Math, Code, Sci"
  how_well: "Pass@1, Accuracy"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment