Implementation:NVIDIA NeMo Aligner DPOTrainer Fit

Implementation Details
Name	DPOTrainer_Fit
Type	API Doc
Implements Principle	DPO_Training
Module	nemo_aligner.algorithms
Repository	NeMo Aligner
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for executing the DPO training loop for preference-based alignment provided by the NeMo Aligner algorithms module.

Description

The DPOTrainer class implements the DPO training loop: iterating over epochs, computing DPO loss (comparing chosen vs rejected log-probability ratios against the reference policy), running validation, and saving checkpoints. It supports multiple loss variants (DPO, IPO, RPO) through the model's configurable loss function. The MegatronGPTDPOModel handles the actual loss computation, reference policy management, and distributed forward-backward passes.

Usage

Used in train_gpt_dpo.py for DPO, IPO, and RPO training. The model handles reference policy log-probability computation internally.

Code Reference

Source Location

Repository: NeMo Aligner
File: nemo_aligner/algorithms/dpo.py (L119-388 DPOTrainer), nemo_aligner/models/nlp/gpt/megatron_gpt_dpo_model.py (L47-606 MegatronGPTDPOModel)

Signature

class DPOTrainer:
    def __init__(
        self,
        cfg: DictConfig,
        model,                              # MegatronGPTDPOModel
        optimizer,
        scheduler,
        train_dataloader,
        val_dataloader,
        test_dataloader,
        collate_fn: DistributedCollateFunction,
        logger,
        ckpt_callback,
        run_timer,
    ):
        ...

    def fit(self) -> None:
        """Main DPO training loop."""

    def run_validation(self) -> Tuple[float, Dict]:
        """Validation with preference accuracy metrics."""

Import

from nemo_aligner.algorithms.dpo import DPOTrainer
from nemo_aligner.models.nlp.gpt.megatron_gpt_dpo_model import MegatronGPTDPOModel

I/O Contract

Inputs

Name	Type	Required	Description
cfg	DictConfig	Yes	DPO config: max_epochs, val_check_interval, save_interval
model	MegatronGPTDPOModel	Yes	DPO model with reference policy
collate_fn	DistributedCollateFunction	Yes	dpo_custom_collate or DPOPackedDataset.global_collate_fn
train_dataloader	DataLoader	Yes	Preference pair DataLoader
val_dataloader	DataLoader	Yes	Validation DataLoader

Outputs

Name	Type	Description
(side effect)	None	Updated model weights, checkpoints
metrics	Dict	Per-step: loss, sft_loss, preference_loss, acc, rewards_chosen_mean, rewards_rejected_mean

Usage Examples

from nemo_aligner.algorithms.dpo import DPOTrainer, dpo_custom_collate
from nemo_aligner.models.nlp.gpt.megatron_gpt_dpo_model import MegatronGPTDPOModel

model = load_from_nemo(MegatronGPTDPOModel, model_cfg, trainer, restore_path=path)

dpo_trainer = DPOTrainer(
    cfg=cfg.trainer.dpo,
    model=model,
    optimizer=optimizer,
    scheduler=scheduler,
    train_dataloader=train_dl,
    val_dataloader=val_dl,
    test_dataloader=test_dl,
    collate_fn=partial(dpo_custom_collate, eos_id=model.tokenizer.eos_id),
    logger=logger,
    ckpt_callback=ckpt_callback,
    run_timer=timer,
)
dpo_trainer.fit()

Related Pages

Knowledge Sources

NeMo Aligner

NLP, Alignment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment