Principle:Alibaba ROLL DPO Worker Initialization
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Systems, Alignment |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
A distributed initialization principle for setting up the two-cluster architecture (trainable policy + frozen reference) required by DPO training.
Description
DPO requires two model instances: a trainable policy model and a frozen reference model. Both are deployed as distributed clusters, potentially spanning multiple GPUs with tensor/pipeline parallelism. The initialization phase creates both clusters, loads model weights, and ensures the reference model remains frozen during training.
Usage
Use during the initialization phase of DPO training to set up the actor_train and reference clusters.
Theoretical Basis
DPO compares policy and reference log probabilities: . This requires two separate model instances.
Related Pages
Implemented By
Related Heuristics
The following heuristics inform this principle: