Principle:SqueezeAILab ETS Reward Model Serving
| Knowledge Sources | |
|---|---|
| Domains | Inference, Model_Serving, Reward_Modeling |
| Last Updated | 2026-02-14 02:00 GMT |
Overview
A serving pattern that deploys a Process Reward Model (PRM) as an HTTP server for step-level scoring during tree search.
Description
The Process Reward Model (PRM) provides step-level quality scores that guide the tree search. Unlike outcome reward models that only score complete solutions, PRMs evaluate each intermediate reasoning step, enabling more fine-grained search decisions. The PRM is deployed as a separate SGLang server on its own GPU, with a reduced memory fraction (--mem-fraction-static 0.85) to leave room for a collocated sentence embedding model used for diversity scoring.
The PRM scoring is integrated into the generation pipeline via SGLang's set_score_backend mechanism: after each generation step, the reward server scores the step by extracting logits at specific token positions (e.g., token ID 8094 for llemma, or token IDs [648, 387, 12902] for mistral).
Usage
Deploy a reward model server before running any ETS tree search experiment. The server must be running and accessible via HTTP. It is connected to the tree search via RuntimeEndpoint and used by Tree.expand() to score each generated step.
Theoretical Basis
Process Reward Models score intermediate reasoning steps rather than final answers. This enables:
- Early pruning: Low-quality reasoning paths can be identified and abandoned before reaching a final answer
- Step-level guidance: The tree search can allocate more compute budget to promising partial solutions
- Fine-grained credit assignment: Each step receives its own score, enabling better node selection in the ILP formulation
The PRM scoring mechanism works by performing a forward pass with max_tokens=0 (no generation) and extracting logits at specific token positions that correspond to quality indicators (e.g., positive/negative step tags).