Implementation:NVIDIA NeMo Aligner RM Get Loss And Metrics

Implementation Details
Name	RM_Get_Loss_And_Metrics
Type	API Doc
Implements	Reward_Model_Validation
Repository	NeMo Aligner
Primary File	nemo_aligner/models/nlp/gpt/megatron_gpt_reward_model.py
Domains	NLP, Evaluation
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for computing reward model validation metrics including ranking loss and accuracy provided by the MegatronGPTRewardModel class.

Description

The get_loss_and_metrics method of MegatronGPTRewardModel runs the forward-backward function on preference pairs, gathers chosen and rejected rewards across distributed ranks, computes the Bradley-Terry ranking loss, and returns comprehensive validation metrics. The method handles pipeline parallelism by broadcasting results from the last pipeline stage. Metrics include loss, ranking accuracy, mean chosen/rejected rewards, and reward distribution statistics.

Usage

Called by SupervisedTrainer during validation steps with forward_only=True. Also called during training steps with forward_only=False for backpropagation.

Code Reference

Source Location

Repository: NeMo Aligner
File: nemo_aligner/models/nlp/gpt/megatron_gpt_reward_model.py
Lines: L249-322

Signature

class MegatronGPTRewardModel(MegatronGPTModel, SupervisedInterface, Inferrable):
    def get_loss_and_metrics(
        self,
        batch: dict,
        forward_only: bool = True,
    ) -> Tuple[torch.Tensor, Dict[str, float]]:
        """Compute reward model loss and validation metrics.

        Args:
            batch: Dict with 'chosen', 'rejected', 'chosen_length', 'rejected_length'
            forward_only: If True, no gradient computation (validation mode)

        Returns:
            loss_mean: Scalar loss averaged across microbatches
            metrics: Dict with 'loss', 'acc', 'rewards_chosen_mean',
                    'rewards_rejected_mean', 'rewards_all_mean', 'rewards_all_std'
        """

Import

from nemo_aligner.models.nlp.gpt.megatron_gpt_reward_model import MegatronGPTRewardModel

I/O Contract

Inputs

Name	Type	Required	Description
`batch`	`dict`	Yes	Contains chosen, rejected (token tensors), chosen_length, rejected_length
`forward_only`	`bool`	No	Validation mode (no gradients) when True. Default True

Outputs

Name	Type	Description
`loss_mean`	`torch.Tensor`	Scalar ranking loss
`metrics`	`Dict[str, float]`	loss, acc, rewards_chosen_mean, rewards_rejected_mean, rewards_all_mean, rewards_all_std

Usage Examples

# Called internally by SupervisedTrainer.validation_step
model.prepare_for_validation_step()
loss_mean, metrics = model.get_loss_and_metrics(batch=val_batch, forward_only=True)
model.finish_validation_step()

# metrics contains:
# {"loss": 0.35, "acc": 0.72, "rewards_chosen_mean": 1.2,
#  "rewards_rejected_mean": -0.3, "rewards_all_mean": 0.45, "rewards_all_std": 0.8}

Related Pages

Knowledge Sources

NeMo Aligner

NLP | Evaluation

2026-02-07 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment