Principle:Alibaba ROLL Reference Log Probability

Knowledge Sources	DPO Alibaba ROLL
Domains	Alignment, LLM_Inference
Last Updated	2026-02-07 20:00 GMT

Overview

An inference principle for computing per-token log probabilities from a frozen reference model to serve as the DPO baseline.

Description

The DPO loss requires comparing the policy model's log probabilities with those of a fixed reference model. This step computes the reference log probabilities for both chosen and rejected responses in a batch. The reference model is never updated during training, providing a stable baseline.

Usage

Use before the DPO loss computation step. Reference log probs are computed once per batch and cached.

Theoretical Basis

$\log π_{r e f} (y | x) = \sum_{t} \log π_{r e f} (y_{t} | y_{< t}, x)$

Related Pages

Implemented By

Implementation:Alibaba_ROLL_DPO_ActorWorker_Compute_Log_Probs

Related Heuristics

The following heuristics inform this principle:

Heuristic:Alibaba_ROLL_Numerical_Stability_Epsilon

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment