Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:SqueezeAILab ETS Reward Model Serving

From Leeroopedia
Knowledge Sources
Domains Inference, Model_Serving, Reward_Modeling
Last Updated 2026-02-14 02:00 GMT

Overview

A serving pattern that deploys a Process Reward Model (PRM) as an HTTP server for step-level scoring during tree search.

Description

The Process Reward Model (PRM) provides step-level quality scores that guide the tree search. Unlike outcome reward models that only score complete solutions, PRMs evaluate each intermediate reasoning step, enabling more fine-grained search decisions. The PRM is deployed as a separate SGLang server on its own GPU, with a reduced memory fraction (--mem-fraction-static 0.85) to leave room for a collocated sentence embedding model used for diversity scoring.

The PRM scoring is integrated into the generation pipeline via SGLang's set_score_backend mechanism: after each generation step, the reward server scores the step by extracting logits at specific token positions (e.g., token ID 8094 for llemma, or token IDs [648, 387, 12902] for mistral).

Usage

Deploy a reward model server before running any ETS tree search experiment. The server must be running and accessible via HTTP. It is connected to the tree search via RuntimeEndpoint and used by Tree.expand() to score each generated step.

Theoretical Basis

Process Reward Models score intermediate reasoning steps rather than final answers. This enables:

  • Early pruning: Low-quality reasoning paths can be identified and abandoned before reaching a final answer
  • Step-level guidance: The tree search can allocate more compute budget to promising partial solutions
  • Fine-grained credit assignment: Each step receives its own score, enabling better node selection in the ILP formulation

The PRM scoring mechanism works by performing a forward pass with max_tokens=0 (no generation) and extracting logits at specific token positions that correspond to quality indicators (e.g., positive/negative step tags).

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment