Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:NVIDIA NeMo Aligner DPOTrainer Fit

From Leeroopedia


Implementation Details
Name DPOTrainer_Fit
Type API Doc
Implements Principle DPO_Training
Module nemo_aligner.algorithms
Repository NeMo Aligner
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for executing the DPO training loop for preference-based alignment provided by the NeMo Aligner algorithms module.

Description

The DPOTrainer class implements the DPO training loop: iterating over epochs, computing DPO loss (comparing chosen vs rejected log-probability ratios against the reference policy), running validation, and saving checkpoints. It supports multiple loss variants (DPO, IPO, RPO) through the model's configurable loss function. The MegatronGPTDPOModel handles the actual loss computation, reference policy management, and distributed forward-backward passes.

Usage

Used in train_gpt_dpo.py for DPO, IPO, and RPO training. The model handles reference policy log-probability computation internally.

Code Reference

Source Location

  • Repository: NeMo Aligner
  • File: nemo_aligner/algorithms/dpo.py (L119-388 DPOTrainer), nemo_aligner/models/nlp/gpt/megatron_gpt_dpo_model.py (L47-606 MegatronGPTDPOModel)

Signature

class DPOTrainer:
    def __init__(
        self,
        cfg: DictConfig,
        model,                              # MegatronGPTDPOModel
        optimizer,
        scheduler,
        train_dataloader,
        val_dataloader,
        test_dataloader,
        collate_fn: DistributedCollateFunction,
        logger,
        ckpt_callback,
        run_timer,
    ):
        ...

    def fit(self) -> None:
        """Main DPO training loop."""

    def run_validation(self) -> Tuple[float, Dict]:
        """Validation with preference accuracy metrics."""

Import

from nemo_aligner.algorithms.dpo import DPOTrainer
from nemo_aligner.models.nlp.gpt.megatron_gpt_dpo_model import MegatronGPTDPOModel

I/O Contract

Inputs

Name Type Required Description
cfg DictConfig Yes DPO config: max_epochs, val_check_interval, save_interval
model MegatronGPTDPOModel Yes DPO model with reference policy
collate_fn DistributedCollateFunction Yes dpo_custom_collate or DPOPackedDataset.global_collate_fn
train_dataloader DataLoader Yes Preference pair DataLoader
val_dataloader DataLoader Yes Validation DataLoader

Outputs

Name Type Description
(side effect) None Updated model weights, checkpoints
metrics Dict Per-step: loss, sft_loss, preference_loss, acc, rewards_chosen_mean, rewards_rejected_mean

Usage Examples

from nemo_aligner.algorithms.dpo import DPOTrainer, dpo_custom_collate
from nemo_aligner.models.nlp.gpt.megatron_gpt_dpo_model import MegatronGPTDPOModel

model = load_from_nemo(MegatronGPTDPOModel, model_cfg, trainer, restore_path=path)

dpo_trainer = DPOTrainer(
    cfg=cfg.trainer.dpo,
    model=model,
    optimizer=optimizer,
    scheduler=scheduler,
    train_dataloader=train_dl,
    val_dataloader=val_dl,
    test_dataloader=test_dl,
    collate_fn=partial(dpo_custom_collate, eos_id=model.tokenizer.eos_id),
    logger=logger,
    ckpt_callback=ckpt_callback,
    run_timer=timer,
)
dpo_trainer.fit()

Related Pages

Knowledge Sources

NLP, Alignment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment