Workflow:SqueezeAILab ETS ETS Experiment Pipeline

Knowledge Sources	SqueezeAILab ETS ETS: Efficient Tree Search for Inference-Time Scaling sglang-ets
Domains	LLMs, Inference_Time_Scaling, Tree_Search
Last Updated	2025-02-14 00:00 GMT

Overview

End-to-end process for running Efficient Tree Search (ETS) experiments on math benchmarks using a policy model, a process reward model, and the ETS cost-model-based selection strategy.

Description

This workflow covers the complete process of running an ETS inference-time scaling experiment. ETS performs tree search over candidate solution steps generated by a policy LLM, scored by a process reward model (PRM), and pruned via an Integer Linear Program (ILP) that maximizes expected outcome while encouraging KV cache sharing and trajectory diversity. The workflow spans from launching the model servers through to collecting the raw answer files that feed into evaluation.

Goal: Produce a JSON file of candidate answers (with per-step PRM scores) for each problem in a math benchmark dataset.

Scope: Covers server launch, hyperparameter configuration, tree search execution across multiple search widths, and result collection.

Strategy: Uses a client-server architecture where the policy model and reward model run as separate SGLang servers on different GPUs. The ETS algorithm (softmax_costmodel selection) formulates node selection as an ILP using PuLP, optionally enforcing trajectory diversity via sentence embedding clustering.

Usage

Execute this workflow when you have a math benchmark dataset (e.g., MATH500, GSM8K) and want to generate candidate solutions using ETS tree search. You need at least two GPUs: one for the policy model and one for the process reward model. The output is a JSON file per search width containing all candidate trajectories and their step-level scores, ready for downstream evaluation.

Execution Steps

Step 1: Install dependencies

Clone the ETS repository and its modified SGLang fork. Install the custom SGLang package that supports process reward model scoring during tree search. Also install the outlines library at the pinned version for constrained generation support.

Key considerations:

The SGLang fork (sglang-ets) is required; the upstream SGLang does not support PRM-guided tree search
Outlines must be pinned to version 0.0.44 for compatibility

Step 2: Launch policy model server

Start the generator (policy) model as an SGLang server on a dedicated GPU. The server exposes an HTTP endpoint that the tree search client connects to for batched LLM inference. The model is loaded with tensor parallelism support if needed.

Key considerations:

The policy model runs on GPU 0 by default (configurable via CUDA_VISIBLE_DEVICES)
The server must be fully initialized before running tree search
Use tmux or background processes to keep the server running

Step 3: Launch reward model server

Start the process reward model (PRM) as a separate SGLang server on a second GPU. The PRM scores each generated step to guide the tree search. Memory allocation is adjusted to leave room for a collocated embedding model if diversity enforcement is enabled.

Key considerations:

The reward model runs on GPU 1 by default
Use --mem-fraction-static 0.85 to reserve memory for the embedding model
The PRM must be running before tree search begins

Step 4: Configure hyperparameters

Select or create a YAML configuration file that controls the tree search behavior. Key parameters include search width (number of candidate trajectories), selection method (softmax_costmodel for ETS), temperature settings, cost penalty (lambdac), and diversity weight (lambdas).

Key parameters:

width: Number of candidate trajectories (e.g., 16, 64, 256)
select_method: "softmax_costmodel" for ETS
lambdac: Cost penalty weight encouraging KV cache sharing
lambdas: Diversity enforcement weight via sentence embedding clustering
softmax_temperature: Controls sharpness of score-based allocation
max_step_tokens / max_tokens: Token budget per step and per trajectory

Step 5: Run ETS tree search

Execute the tree search engine against the benchmark dataset. For each problem, the engine initializes a search tree, then iterates through depth levels: expanding candidate nodes by generating new solution steps from the policy model, scoring them with the PRM, and selecting which nodes to retain using the ETS ILP formulation. The ILP maximizes softmax-weighted outcome scores while penalizing total cost (encouraging KV cache sharing) and optionally enforcing trajectory diversity via sentence embedding clustering.

What happens at each depth:

Candidate nodes at the current depth are collected with their PRM scores
Leaf nodes (containing final answers) or nodes exceeding the token budget are retired
The ETS selection method solves an ILP to decide which nodes to keep and how many children to allocate
Retained nodes are expanded by forking the SGLang state and generating the next solution step
New nodes are scored by the PRM and inserted into the tree

Step 6: Collect results

After tree search completes for all problems, the engine writes a JSON output file containing, for each problem: the original question, ground truth answer, all leaf trajectories with their full text, and per-step PRM scores. Statistics (total KV cache size, model calls, tokens generated, wall time) are logged separately.

Output structure:

answers.json: Array of objects, each with question, model_answer (list of trajectories with step_scores), ground_truth_answer, and total_tokens
stats.log: Aggregate efficiency metrics across all problems

Execution Diagram

GitHub URL

Workflow Repository