Principle:NVIDIA NeMo Aligner RLHF Prompt Data Preparation
| Principle Metadata | |
|---|---|
| Type | Principle |
| Domains | NLP, Data_Engineering |
| Last Updated | 2026-02-07 00:00 GMT |
| Related Implementation | Implementation:NVIDIA_NeMo_Aligner_Build_RLHF_Datasets |
Overview
Process of constructing prompt-only datasets for online reinforcement learning alignment where responses are generated during training.
Description
Unlike supervised methods (SFT, DPO) which train on pre-existing prompt-response pairs, RLHF methods (PPO, REINFORCE) require prompt-only datasets. The model generates responses online during training, which are then scored by the reward model.
This data preparation step:
- Tokenizes and pads prompts from JSONL files into the RLHFDataset format
- Produces prompt-only tensors with no response component
- Shares the same dataset builder and format between PPO and REINFORCE workflows
The dataset is intentionally simpler than SFT/DPO datasets since it only needs to provide prompts — responses are produced by the actor model during rollouts.
Input format example:
{
"prompt": "Write a short poem about the ocean."
}
Usage
Use when preparing data for PPO or REINFORCE training. The input is JSONL with prompt text only (no responses). The same dataset builder and format is shared between PPO and REINFORCE workflows.
Key differences from other dataset types:
- SFT datasets — Contain prompt + response pairs for supervised training
- DPO datasets — Contain prompt + chosen/rejected pairs for direct preference optimization
- RLHF datasets — Contain prompts only; responses are generated by the policy during training
Configuration:
model:
data:
data_prefix:
train: ["/path/to/prompts.jsonl"]
validation: ["/path/to/val_prompts.jsonl"]
Theoretical Basis
In online RL for language models, the policy generates responses to prompts, and a reward signal guides optimization. The prompt dataset defines the task distribution — the set of instructions or queries that the model will learn to respond to.
The dataset builder is a partial application of build_train_valid_test_datasets specialized to RLHFDataset. This design enables code reuse across different dataset types while maintaining a clean separation between the prompt-only format required by RLHF and the paired formats used by supervised methods.
Training Loop:
1. Sample prompt from RLHFDataset
2. Actor model generates response (rollout)
3. Reward model scores the response
4. Policy gradient updates the actor