Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Alibaba ROLL DPO Worker Initialization

From Leeroopedia


Knowledge Sources
Domains Distributed_Systems, Alignment
Last Updated 2026-02-07 20:00 GMT

Overview

A distributed initialization principle for setting up the two-cluster architecture (trainable policy + frozen reference) required by DPO training.

Description

DPO requires two model instances: a trainable policy model and a frozen reference model. Both are deployed as distributed clusters, potentially spanning multiple GPUs with tensor/pipeline parallelism. The initialization phase creates both clusters, loads model weights, and ensures the reference model remains frozen during training.

Usage

Use during the initialization phase of DPO training to set up the actor_train and reference clusters.

Theoretical Basis

DPO compares policy and reference log probabilities: logπθ(y|x)logπref(y|x). This requires two separate model instances.

Related Pages

Implemented By

Related Heuristics

The following heuristics inform this principle:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment