Principle:SqueezeAILab ETS Policy Model Serving
| Knowledge Sources | |
|---|---|
| Domains | Inference, Model_Serving |
| Last Updated | 2026-02-14 02:00 GMT |
Overview
A serving pattern that deploys a language model as an HTTP server for batched text generation during tree search.
Description
In the ETS framework, the policy model (also called the generator) is responsible for producing candidate solution steps at each node of the search tree. Rather than loading the model in-process, it is deployed as a standalone HTTP server using the SGLang framework. This client-server architecture enables:
- GPU isolation: The policy model runs on a dedicated GPU, preventing memory contention with the reward model
- Concurrent access: Multiple tree search threads can issue generation requests simultaneously
- State management: SGLang manages KV cache states server-side, enabling efficient state forking for tree expansion
The policy model server accepts text generation requests and returns completions with configurable stopping conditions (e.g., stop at step delimiter tokens like "ки" for llemma/mistral models, or "\n\n" for llama models).
Usage
Deploy a policy model server before running any ETS tree search experiment. The server must be running and accessible via HTTP before invoking rebase.py. Typical models include Llemma-7B, Mistral-7B, or Llama-3.x variants.
Theoretical Basis
Model serving for inference-time compute follows the disaggregated inference pattern: separating the generation model from the scoring model allows independent scaling and GPU memory management. In tree search specifically, the policy model must support:
- State forking: Duplicating the KV cache to explore multiple branches from the same prefix
- Batched generation: Processing multiple generation requests in parallel for throughput
- Tensor parallelism: Distributing model weights across GPUs for large models