Principle:Alibaba ROLL Reference Log Probability
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Alignment, LLM_Inference |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
An inference principle for computing per-token log probabilities from a frozen reference model to serve as the DPO baseline.
Description
The DPO loss requires comparing the policy model's log probabilities with those of a fixed reference model. This step computes the reference log probabilities for both chosen and rejected responses in a batch. The reference model is never updated during training, providing a stable baseline.
Usage
Use before the DPO loss computation step. Reference log probs are computed once per batch and cached.
Theoretical Basis
Related Pages
Implemented By
Related Heuristics
The following heuristics inform this principle:
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment