Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Testtimescaling Testtimescaling github io Taxonomy Classification

From Leeroopedia


Knowledge Sources "What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models" (arXiv: 2503.24235)
Domains Research_Methodology, Academic_Survey
Last Updated 2026-02-14

Overview

Multi-dimensional paper classification using a hierarchical taxonomy that categorizes test-time scaling papers along four orthogonal dimensions: What, How, Where, and How Well.

Description

Taxonomy Classification is the process of analyzing a paper that has passed relevance screening and assigning it values across four dimensions defined by the survey's taxonomic framework. This structured classification enables meaningful comparison across papers and reveals research trends and gaps.

The four dimensions are orthogonal, meaning a paper's classification on one dimension is independent of its classification on others. This creates a multi-dimensional space in which each paper occupies a specific position, enabling systematic comparison and gap analysis.

Dimension 1 -- What to Scale: This dimension captures the fundamental strategy for allocating additional computation at inference time.

  • Parallel: Generate multiple outputs simultaneously and aggregate results (e.g., Best-of-N sampling, majority voting)
  • Sequential: Build later computations on intermediate results from earlier steps (e.g., chain-of-thought, iterative refinement)
  • Hybrid: Combine parallel and sequential strategies within a single approach
  • Internal: The model autonomously determines how much computation to allocate (e.g., adaptive compute, early exit mechanisms)

Dimension 2 -- How to Scale: This dimension identifies the specific methods and techniques used. It is subdivided into tuning methods and inference methods:

  • Tuning Methods: SFT (Supervised Fine-Tuning), RL (Reinforcement Learning including GRPO, DPO, PPO, and others)
  • Inference Methods: STI (Stimulation, e.g., CoT prompting, self-refinement, budget forcing), VER (Verification, e.g., PRM, self-evaluation), SEA (Search, e.g., MCTS, beam search), AGG (Aggregation, e.g., Best-of-N, fusion)

Dimension 3 -- Where to Scale: This dimension categorizes the application domains where the method is evaluated:

  • Reasoning Tasks: Math, Code, Science (Sci), Game
  • General-Purpose Tasks: Basics, Agents, Knowledge, Open-Ended, Multi-Modal

Dimension 4 -- How Well to Scale: This dimension captures the evaluation metrics used to measure effectiveness:

  • Performance metrics (Pass@1, Accuracy, Win Rate)
  • Efficiency metrics (tokens per problem, FLOPs)
  • Controllability metrics (ability to adjust compute allocation)
  • Scalability metrics (how performance improves with additional compute)

Usage

Use this principle after a paper passes relevance screening (Step 1 of the Adding_a_New_Paper workflow). The output of taxonomy classification feeds directly into the comparison table entry (Step 3). The classifier should read the paper's methodology and evaluation sections carefully enough to assign values to all applicable dimensions.

A paper does not need to have a value for every sub-category. For example, a paper that only uses chain-of-thought prompting without any tuning would have values for STI but would be marked as not applicable for SFT and RL.

Theoretical Basis

The taxonomy originates from the survey paper "What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models" (arXiv: 2503.24235). It was designed to provide a comprehensive yet non-overlapping classification framework for the rapidly growing body of test-time scaling literature.

The design principles of the taxonomy are:

Orthogonality: Each dimension captures a distinct aspect of a paper's contribution. The "What" dimension describes the computational strategy, "How" describes the technical method, "Where" describes the application domain, and "How Well" describes the evaluation approach. These four aspects are largely independent, meaning changes in one dimension do not necessitate changes in another.

Exhaustiveness: Within each dimension, the categories are designed to cover the full space of existing and foreseeable approaches. The "What" dimension spans from fully parallel to fully sequential with hybrid and internal as intermediate categories. The "How" dimension covers both training-time preparation (SFT, RL) and inference-time execution (STI, VER, SEA, AGG).

Granularity: The taxonomy provides enough detail to distinguish meaningfully different approaches (e.g., separating MCTS from beam search within the SEA category) while remaining abstract enough to avoid a separate category for every individual paper.

Hierarchical structure: The "How" dimension demonstrates hierarchical organization, with a top-level split between tuning methods and inference methods, and further subdivision into specific technique categories. This enables both high-level and fine-grained comparison.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment