Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Alibaba ROLL Reference Log Probability

From Leeroopedia


Knowledge Sources
Domains Alignment, LLM_Inference
Last Updated 2026-02-07 20:00 GMT

Overview

An inference principle for computing per-token log probabilities from a frozen reference model to serve as the DPO baseline.

Description

The DPO loss requires comparing the policy model's log probabilities with those of a fixed reference model. This step computes the reference log probabilities for both chosen and rejected responses in a batch. The reference model is never updated during training, providing a stable baseline.

Usage

Use before the DPO loss computation step. Reference log probs are computed once per batch and cached.

Theoretical Basis

logπref(y|x)=tlogπref(yt|y<t,x)

Related Pages

Implemented By

Related Heuristics

The following heuristics inform this principle:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment