Implementation:NVIDIA NeMo Aligner DPOTrainer Fit
| Implementation Details | |
|---|---|
| Name | DPOTrainer_Fit |
| Type | API Doc |
| Implements Principle | DPO_Training |
| Module | nemo_aligner.algorithms |
| Repository | NeMo Aligner |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for executing the DPO training loop for preference-based alignment provided by the NeMo Aligner algorithms module.
Description
The DPOTrainer class implements the DPO training loop: iterating over epochs, computing DPO loss (comparing chosen vs rejected log-probability ratios against the reference policy), running validation, and saving checkpoints. It supports multiple loss variants (DPO, IPO, RPO) through the model's configurable loss function. The MegatronGPTDPOModel handles the actual loss computation, reference policy management, and distributed forward-backward passes.
Usage
Used in train_gpt_dpo.py for DPO, IPO, and RPO training. The model handles reference policy log-probability computation internally.
Code Reference
Source Location
- Repository: NeMo Aligner
- File:
nemo_aligner/algorithms/dpo.py(L119-388 DPOTrainer),nemo_aligner/models/nlp/gpt/megatron_gpt_dpo_model.py(L47-606 MegatronGPTDPOModel)
Signature
class DPOTrainer:
def __init__(
self,
cfg: DictConfig,
model, # MegatronGPTDPOModel
optimizer,
scheduler,
train_dataloader,
val_dataloader,
test_dataloader,
collate_fn: DistributedCollateFunction,
logger,
ckpt_callback,
run_timer,
):
...
def fit(self) -> None:
"""Main DPO training loop."""
def run_validation(self) -> Tuple[float, Dict]:
"""Validation with preference accuracy metrics."""
Import
from nemo_aligner.algorithms.dpo import DPOTrainer
from nemo_aligner.models.nlp.gpt.megatron_gpt_dpo_model import MegatronGPTDPOModel
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| cfg | DictConfig | Yes | DPO config: max_epochs, val_check_interval, save_interval |
| model | MegatronGPTDPOModel | Yes | DPO model with reference policy |
| collate_fn | DistributedCollateFunction | Yes | dpo_custom_collate or DPOPackedDataset.global_collate_fn |
| train_dataloader | DataLoader | Yes | Preference pair DataLoader |
| val_dataloader | DataLoader | Yes | Validation DataLoader |
Outputs
| Name | Type | Description |
|---|---|---|
| (side effect) | None | Updated model weights, checkpoints |
| metrics | Dict | Per-step: loss, sft_loss, preference_loss, acc, rewards_chosen_mean, rewards_rejected_mean |
Usage Examples
from nemo_aligner.algorithms.dpo import DPOTrainer, dpo_custom_collate
from nemo_aligner.models.nlp.gpt.megatron_gpt_dpo_model import MegatronGPTDPOModel
model = load_from_nemo(MegatronGPTDPOModel, model_cfg, trainer, restore_path=path)
dpo_trainer = DPOTrainer(
cfg=cfg.trainer.dpo,
model=model,
optimizer=optimizer,
scheduler=scheduler,
train_dataloader=train_dl,
val_dataloader=val_dl,
test_dataloader=test_dl,
collate_fn=partial(dpo_custom_collate, eos_id=model.tokenizer.eos_id),
logger=logger,
ckpt_callback=ckpt_callback,
run_timer=timer,
)
dpo_trainer.fit()
Related Pages
- Principle:NVIDIA_NeMo_Aligner_DPO_Training
- Environment:NVIDIA_NeMo_Aligner_NeMo_Framework_GPU_Environment
- Heuristic:NVIDIA_NeMo_Aligner_Higher_Stability_Log_Probs
- Heuristic:NVIDIA_NeMo_Aligner_DPO_Sequence_Packing_Tips