Principle:SqueezeAILab ETS Reward Model Serving

Knowledge Sources	ETS ETS Paper
Domains	Inference, Model_Serving, Reward_Modeling
Last Updated	2026-02-14 02:00 GMT

Overview

A serving pattern that deploys a Process Reward Model (PRM) as an HTTP server for step-level scoring during tree search.

Description

The Process Reward Model (PRM) provides step-level quality scores that guide the tree search. Unlike outcome reward models that only score complete solutions, PRMs evaluate each intermediate reasoning step, enabling more fine-grained search decisions. The PRM is deployed as a separate SGLang server on its own GPU, with a reduced memory fraction (--mem-fraction-static 0.85) to leave room for a collocated sentence embedding model used for diversity scoring.

The PRM scoring is integrated into the generation pipeline via SGLang's set_score_backend mechanism: after each generation step, the reward server scores the step by extracting logits at specific token positions (e.g., token ID 8094 for llemma, or token IDs [648, 387, 12902] for mistral).

Usage

Deploy a reward model server before running any ETS tree search experiment. The server must be running and accessible via HTTP. It is connected to the tree search via RuntimeEndpoint and used by Tree.expand() to score each generated step.

Theoretical Basis

Process Reward Models score intermediate reasoning steps rather than final answers. This enables:

Early pruning: Low-quality reasoning paths can be identified and abandoned before reaching a final answer
Step-level guidance: The tree search can allocate more compute budget to promising partial solutions
Fine-grained credit assignment: Each step receives its own score, enabling better node selection in the ILP formulation

The PRM scoring mechanism works by performing a forward pass with max_tokens=0 (no generation) and extracting logits at specific token positions that correspond to quality indicators (e.g., positive/negative step tags).

Related Pages

Implemented By

Implementation:SqueezeAILab_ETS_Sglang_Launch_Reward_Server

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment