Principle:OpenRLHF OpenRLHF Process Reward Model Training

Knowledge Sources	Let's Verify Step by Step Solving Math Word Problems
Domains	Reward_Modeling, Reasoning, Training
Last Updated	2026-02-07 10:40 GMT

Overview

Training methodology that teaches a model to evaluate the correctness of each intermediate reasoning step rather than only the final answer.

Description

Process Reward Modeling (PRM) trains a reward model to assign scores at each step of a reasoning chain. This contrasts with Outcome Reward Models (ORMs) that only evaluate the final result. By providing step-level feedback, PRMs enable more precise credit assignment during RL training, helping identify exactly where reasoning goes wrong. The model predicts step-level labels at designated placeholder token positions in the input sequence. Labels can be hard (correct/incorrect tokens) or soft (float reward values).

Usage

Use process reward model training when you need fine-grained feedback on multi-step reasoning tasks such as mathematical problem solving, code generation, or logical deduction. PRMs are particularly valuable for training verifiers that guide search-based inference methods like best-of-N sampling or tree search.

Theoretical Basis

At each step boundary (marked by a placeholder token), the PRM predicts the probability that the reasoning so far is correct:

$P (correct | x, s_{1}, \dots, s_{k}) = σ (f_{θ} (x, s_{1}, \dots, s_{k}))$

Where $s_{k}$ is the k-th reasoning step.

Training objective: $L_{P R M} = - \sum_{k = 1}^{K} [y_{k} \log P_{k} + (1 - y_{k}) \log (1 - P_{k})]$

Where $y_{k} \in {0, 1}$ (or a soft label in [0,1]) indicates step correctness.

Pseudo-code Logic:

# Abstract algorithm (NOT actual implementation)
logits = model(input_with_placeholder_tokens)
for each placeholder_position:
    step_pred = logits[placeholder_position]
    step_label = labels[placeholder_position]
    loss += cross_entropy(step_pred, step_label)
accuracy = (predicted_labels == true_labels).mean()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment