Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Testtimescaling Testtimescaling github io Paper Taxonomy Analysis

From Leeroopedia


Type Pattern Doc (manual analysis process)
Source README.md:L47-73 (taxonomy reference)
Domains Research_Methodology, Academic_Survey
Last Updated 2026-02-14

Overview

A structured manual analysis process for classifying a paper across all dimensions of the test-time scaling taxonomy, producing column values for the comparison table.

Description

This pattern documents the analytical process a contributor follows to classify a paper that has already passed relevance screening. The output is a complete set of taxonomy values that will populate one row in the comparison table.

The classification requires careful reading of the paper's methodology, experimental setup, and evaluation sections. Each of the nine classification fields (What, SFT, RL, STI, SEA, VER, AGG, Where, How Well) must be assigned either a specific value or marked as not applicable.

The process involves:

  1. Determine the "What" dimension: Read the paper's method description and identify whether it uses Parallel generation, Sequential reasoning, a Hybrid of both, or Internal (model-autonomous) compute allocation.
  2. Classify SFT: Determine if the method involves supervised fine-tuning. If yes, note the specific approach (e.g., distillation, instruction tuning). If not applicable, mark as .
  3. Classify RL: Determine if reinforcement learning is used. If yes, note the specific algorithm (GRPO, DPO, PPO, REINFORCE, etc.). If not applicable, mark as .
  4. Classify STI (Stimulation): Identify if the method uses prompting or stimulation techniques such as Chain-of-Thought (CoT), Self-Refine, Budget Forcing, Think-then-Respond, or similar. Mark as if not applicable.
  5. Classify SEA (Search): Identify if the method employs search algorithms such as Monte Carlo Tree Search (MCTS), Beam Search, Best-First Search, or similar. Mark as if not applicable.
  6. Classify VER (Verification): Identify if the method uses verification or reward models such as Process Reward Models (PRM), Outcome Reward Models (ORM), Self-Evaluation, or similar. Mark as if not applicable.
  7. Classify AGG (Aggregation): Identify if the method aggregates multiple outputs using Best-of-N, Majority Voting, Weighted Fusion, or similar. Mark as if not applicable.
  8. Determine "Where": List all evaluation domains from the paper's experiments: Math, Code, Sci (Science), Game, Basics, Agents, Knowledge, Open-Ended, Multi-Modal.
  9. Determine "How Well": List the evaluation metrics used: Pass@1, Accuracy, Win Rate, BLEU, or any other reported metrics.

Usage

Apply this classification process to every paper that passes the relevance screening step. The analysis typically requires 15-30 minutes per paper for thorough classification. When in doubt about a classification, refer to the taxonomy definitions in the survey paper (arXiv: 2503.24235) and to existing entries in the comparison table for precedent.

Code Reference

Source Location

The taxonomy reference is defined in the repository at README.md:L47-73. The taxonomy categories and their definitions originate from the survey paper.

Interface Specification

TAXONOMY CLASSIFICATION INTERFACE
===================================

Input:
  - paper: {
      title: string,
      arxiv_id: string,
      full_text: accessible paper content (PDF or HTML)
    }

Process:
  For each field, assign one of:
    - A specific value (method name, domain, metric)
    - A combination of values (comma-separated)
    - "✗" (not applicable)

  Fields:
    1. what:     "Parallel" | "Sequential" | "Hybrid" | "Internal" | combination
    2. sft:      specific_method | "✗"
    3. rl:       "GRPO" | "DPO" | "PPO" | "REINFORCE" | other | "✗"
    4. sti:      "CoT" | "Self-Refine" | "Budget Forcing" | other | "✗"
    5. sea:      "MCTS" | "Beam" | "Best-First" | other | "✗"
    6. ver:      "PRM" | "ORM" | "Self-Evaluate" | other | "✗"
    7. agg:      "Best-of-N" | "Majority Vote" | "Fusion" | other | "✗"
    8. where:    list of { "Math" | "Code" | "Sci" | "Game" | "Basics" |
                           "Agents" | "Knowledge" | "Open-Ended" | "Multi-Modal" }
    9. how_well: list of metric names (e.g., "Pass@1", "Accuracy", "Win Rate")

Output:
  - classification: {
      what: string,
      sft: string,
      rl: string,
      sti: string,
      sea: string,
      ver: string,
      agg: string,
      where: string,
      how_well: string
    }

Import

No imports required. This is a manual analytical process. The taxonomy reference should be consulted at README.md:L47-73 in the repository.

I/O Contract

Inputs

Parameter Type Required Description
paper_title String Yes Title of the paper being classified
arxiv_id String Yes arXiv identifier in format XXXX.XXXXX
full_text Paper content Yes Access to the full paper for reading methodology and evaluation sections

Outputs

Output Type Description
what String Scaling strategy: Parallel, Sequential, Hybrid, Internal, or combination
sft String Supervised fine-tuning method or
rl String Reinforcement learning algorithm or
sti String Stimulation/prompting method or
sea String Search algorithm or
ver String Verification method or
agg String Aggregation method or
where String Comma-separated list of evaluation domains
how_well String Comma-separated list of evaluation metrics

Usage Examples

Example 1: Classifying a search-based reasoning paper

Paper: "Scaling LLM Test-Time Compute Optimally..."
arXiv ID: 2408.03314

Analysis:
  - Method description: Uses process reward models to guide tree search
    over reasoning steps, with best-of-N aggregation.
  - Training: No fine-tuning mentioned → SFT: ✗, RL: ✗
  - Inference: Uses verification (PRM) and search (beam-like)
    with aggregation (Best-of-N)

Classification result:
  what:     "Sequential"
  sft:      "✗"
  rl:       "✗"
  sti:      "✗"
  sea:      "Beam"
  ver:      "PRM"
  agg:      "Best-of-N"
  where:    "Math"
  how_well: "Pass@1, Accuracy"

Example 2: Classifying an RL-trained reasoning model

Paper: "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL"
arXiv ID: 2501.12948

Analysis:
  - Method description: Uses GRPO reinforcement learning to train
    extended chain-of-thought reasoning. Also uses SFT for cold start.
  - Training: SFT (cold start), RL (GRPO)
  - Inference: Stimulation (long CoT), no explicit search or verification

Classification result:
  what:     "Sequential"
  sft:      "Cold Start SFT"
  rl:       "GRPO"
  sti:      "Long CoT"
  sea:      "✗"
  ver:      "✗"
  agg:      "✗"
  where:    "Math, Code, Sci"
  how_well: "Pass@1, Accuracy"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment