Principle:OpenRLHF OpenRLHF Process Reward Model Training
| Knowledge Sources | |
|---|---|
| Domains | Reward_Modeling, Reasoning, Training |
| Last Updated | 2026-02-07 10:40 GMT |
Overview
Training methodology that teaches a model to evaluate the correctness of each intermediate reasoning step rather than only the final answer.
Description
Process Reward Modeling (PRM) trains a reward model to assign scores at each step of a reasoning chain. This contrasts with Outcome Reward Models (ORMs) that only evaluate the final result. By providing step-level feedback, PRMs enable more precise credit assignment during RL training, helping identify exactly where reasoning goes wrong. The model predicts step-level labels at designated placeholder token positions in the input sequence. Labels can be hard (correct/incorrect tokens) or soft (float reward values).
Usage
Use process reward model training when you need fine-grained feedback on multi-step reasoning tasks such as mathematical problem solving, code generation, or logical deduction. PRMs are particularly valuable for training verifiers that guide search-based inference methods like best-of-N sampling or tree search.
Theoretical Basis
At each step boundary (marked by a placeholder token), the PRM predicts the probability that the reasoning so far is correct:
Where is the k-th reasoning step.
Training objective:
Where (or a soft label in [0,1]) indicates step correctness.
Pseudo-code Logic:
# Abstract algorithm (NOT actual implementation)
logits = model(input_with_placeholder_tokens)
for each placeholder_position:
step_pred = logits[placeholder_position]
step_label = labels[placeholder_position]
loss += cross_entropy(step_pred, step_label)
accuracy = (predicted_labels == true_labels).mean()