Principle:SqueezeAILab ETS Policy Model Serving

Knowledge Sources	ETS sglang-ets
Domains	Inference, Model_Serving
Last Updated	2026-02-14 02:00 GMT

Overview

A serving pattern that deploys a language model as an HTTP server for batched text generation during tree search.

Description

In the ETS framework, the policy model (also called the generator) is responsible for producing candidate solution steps at each node of the search tree. Rather than loading the model in-process, it is deployed as a standalone HTTP server using the SGLang framework. This client-server architecture enables:

GPU isolation: The policy model runs on a dedicated GPU, preventing memory contention with the reward model
Concurrent access: Multiple tree search threads can issue generation requests simultaneously
State management: SGLang manages KV cache states server-side, enabling efficient state forking for tree expansion

The policy model server accepts text generation requests and returns completions with configurable stopping conditions (e.g., stop at step delimiter tokens like "ки" for llemma/mistral models, or "\n\n" for llama models).

Usage

Deploy a policy model server before running any ETS tree search experiment. The server must be running and accessible via HTTP before invoking rebase.py. Typical models include Llemma-7B, Mistral-7B, or Llama-3.x variants.

Theoretical Basis

Model serving for inference-time compute follows the disaggregated inference pattern: separating the generation model from the scoring model allows independent scaling and GPU memory management. In tree search specifically, the policy model must support:

State forking: Duplicating the KV cache to explore multiple branches from the same prefix
Batched generation: Processing multiple generation requests in parallel for throughput
Tensor parallelism: Distributing model weights across GPUs for large models

Related Pages

Implemented By

Implementation:SqueezeAILab_ETS_Sglang_Launch_Policy_Server

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment