Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:SqueezeAILab ETS Policy Model Serving

From Leeroopedia
Knowledge Sources
Domains Inference, Model_Serving
Last Updated 2026-02-14 02:00 GMT

Overview

A serving pattern that deploys a language model as an HTTP server for batched text generation during tree search.

Description

In the ETS framework, the policy model (also called the generator) is responsible for producing candidate solution steps at each node of the search tree. Rather than loading the model in-process, it is deployed as a standalone HTTP server using the SGLang framework. This client-server architecture enables:

  • GPU isolation: The policy model runs on a dedicated GPU, preventing memory contention with the reward model
  • Concurrent access: Multiple tree search threads can issue generation requests simultaneously
  • State management: SGLang manages KV cache states server-side, enabling efficient state forking for tree expansion

The policy model server accepts text generation requests and returns completions with configurable stopping conditions (e.g., stop at step delimiter tokens like "ки" for llemma/mistral models, or "\n\n" for llama models).

Usage

Deploy a policy model server before running any ETS tree search experiment. The server must be running and accessible via HTTP before invoking rebase.py. Typical models include Llemma-7B, Mistral-7B, or Llama-3.x variants.

Theoretical Basis

Model serving for inference-time compute follows the disaggregated inference pattern: separating the generation model from the scoring model allows independent scaling and GPU memory management. In tree search specifically, the policy model must support:

  • State forking: Duplicating the KV cache to explore multiple branches from the same prefix
  • Batched generation: Processing multiple generation requests in parallel for throughput
  • Tensor parallelism: Distributing model weights across GPUs for large models

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment