Principle:NVIDIA NeMo Aligner Reward Model Data Preparation
| Principle Metadata | |
|---|---|
| Type | Principle |
| Domains | NLP, Data_Engineering |
| Last Updated | 2026-02-07 00:00 GMT |
| Related Implementation | Implementation:NVIDIA_NeMo_Aligner_Build_RM_Datasets |
Overview
Process of constructing comparison datasets from human preference judgments for training reward models.
Description
Reward model training requires paired comparison data where human annotators have ranked two or more responses to the same prompt. The data preparation step converts JSONL files containing chosen/rejected response pairs into tokenized tensors suitable for the Bradley-Terry ranking loss. Each example produces two tokenized sequences (chosen and rejected) padded to the same length, enabling efficient batched comparison during training. The dataset handles both conversation-format and plain-text inputs.
The core data pipeline performs the following steps:
- Reads JSONL files containing chosen and rejected fields
- Tokenizes both response variants with the full context (prompt + response)
- Pads sequences to equal length within each pair
- Returns dictionary batches with both response variants for pairwise comparison
Usage
Use when preparing training data for reward model training in RLHF pipelines. The input format is JSONL with chosen/rejected fields. The output dataset returns dictionary batches with both response variants for pairwise comparison.
Input format example:
{
"prompt": "Explain quantum computing in simple terms.",
"chosen": "Quantum computing uses qubits that can exist in multiple states...",
"rejected": "Quantum computing is a type of computing that is very fast..."
}
Key considerations:
- Ensure consistent tokenization between chosen and rejected sequences
- Padding to equal length within each pair is critical for batched comparison
- Both conversation-format (multi-turn) and plain-text inputs are supported
Theoretical Basis
Human preference data follows the Bradley-Terry model assumption:
P(chosen > rejected) = sigma(r(chosen) - r(rejected))
The dataset must present both alternatives together so the model can compute comparative rewards. Tokenization preserves the full context (prompt + response) for each alternative. This formulation ensures that the reward model learns a relative ordering of responses rather than absolute quality scores.