Principle:Alibaba ROLL Reference Log Probability
| Knowledge Sources | |
|---|---|
| Domains | Alignment, LLM_Inference |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
An inference principle for computing per-token log probabilities from a frozen reference model to serve as the DPO baseline.
Description
The DPO loss requires comparing the policy model's log probabilities with those of a fixed reference model. This step computes the reference log probabilities for both chosen and rejected responses in a batch. The reference model is never updated during training, providing a stable baseline.
Usage
Use before the DPO loss computation step. Reference log probs are computed once per batch and cached.
Theoretical Basis
Related Pages
Implemented By
Related Heuristics
The following heuristics inform this principle: