Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Alibaba ROLL MCoreAdapter DPO Training

From Leeroopedia


Knowledge Sources
Domains Training, DPO, Preference_Optimization
Last Updated 2026-02-07 20:00 GMT

Overview

Preference optimization training that computes log-probability ratios between chosen and rejected responses using a frozen reference model, supporting both DPO sigmoid loss and ORPO odds-ratio loss in a distributed pipeline-parallel setting.

Description

Direct Preference Optimization (DPO) is an offline reinforcement learning method for aligning language models with human preferences. Unlike RLHF with PPO, DPO directly optimizes the policy model using pairs of preferred (chosen) and dispreferred (rejected) responses without training a separate reward model. The key insight is that the optimal policy under a KL-constrained reward maximization objective can be expressed as a function of the log-probability ratio between the policy and a reference model.

This principle describes a distributed DPO trainer that extends the base distributed trainer with:

  1. Reference Model Management: A frozen copy of the model computes reference log-probabilities for both chosen and rejected responses. The reference model runs in forward-only mode through the same pipeline-parallel scheduling as the policy model, and its outputs are injected into the training data before the policy forward pass.
  1. Two-Phase Training Step: Each training step consists of (a) a forward-only pass through the reference model to gather reference log-probs, followed by (b) a full forward-backward pass through the policy model using the DPO loss. Both phases use the pipeline-parallel scheduling function with appropriate micro-batch sizing (doubled batch size since chosen and rejected pairs are concatenated).
  1. Vocabulary-Parallel Log-Probabilities: Token-level log-probabilities are computed using the vocabulary-parallel log-prob function that handles tensor-parallel logit sharding, then summed over the sequence dimension with a loss mask and reduced across context-parallel ranks.
  1. Dual Loss Support: The trainer supports both DPO sigmoid loss (requiring a reference model) and ORPO odds-ratio loss (reference-free, using the model's own log-probabilities as both policy and reference).

Usage

Use this principle when:

  • Aligning a language model using preference pairs (chosen/rejected responses) with either DPO or ORPO loss.
  • The model uses multi-dimensional parallelism (tensor, pipeline, expert, context) and the preference training must work correctly across all parallel dimensions.
  • You need reference-model log-probabilities computed efficiently through pipeline-parallel scheduling without duplicating the model's memory footprint for the reference.

Theoretical Basis

DPO Sigmoid Loss:

DPO=logσ(β[logπθ(yw|x)πref(yw|x)logπθ(yl|x)πref(yl|x)])

where yw is the chosen response, yl is the rejected response, πθ is the policy, πref is the frozen reference, and β controls the strength of the KL constraint.

With label smoothing ϵ:

=(1ϵ)logσ(βlogits)ϵlogσ(βlogits)

ORPO Odds-Ratio Loss:

logodds(y|x)=logp(y|x)log(1p(y|x))

ORPO=logpw|yw|+β(logσ(logoddswlogoddsl))

where log-probabilities are normalized by response length to prevent length bias.

Two-phase training step:

# Phase 1: Reference log-probs (forward only, no gradients)
ref_logprobs = forward_backward_func(
    model=ref_model, forward_only=True, collect_non_loss_data=True)

# Inject reference logprobs into training data
FOR each micro-batch:
    data["reference_chosen_logps"]   = ref_logprobs.chosen
    data["reference_rejected_logps"] = ref_logprobs.rejected

# Phase 2: Policy training (forward + backward)
metrics = forward_backward_func(
    model=policy_model, forward_only=False)
optimizer.step()

Reward computation:

rw=β(logπθ(yw)logπref(yw))

rl=β(logπθ(yl)logπref(yl))

accuracy=𝟙[rw>rl]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment