Principle:NVIDIA NeMo Aligner RLHF Prompt Data Preparation

Principle Metadata
Type	Principle
Domains	NLP, Data_Engineering
Last Updated	2026-02-07 00:00 GMT
Related Implementation	Implementation:NVIDIA_NeMo_Aligner_Build_RLHF_Datasets

Overview

Process of constructing prompt-only datasets for online reinforcement learning alignment where responses are generated during training.

Description

Unlike supervised methods (SFT, DPO) which train on pre-existing prompt-response pairs, RLHF methods (PPO, REINFORCE) require prompt-only datasets. The model generates responses online during training, which are then scored by the reward model.

This data preparation step:

Tokenizes and pads prompts from JSONL files into the RLHFDataset format
Produces prompt-only tensors with no response component
Shares the same dataset builder and format between PPO and REINFORCE workflows

The dataset is intentionally simpler than SFT/DPO datasets since it only needs to provide prompts — responses are produced by the actor model during rollouts.

Input format example:

{
  "prompt": "Write a short poem about the ocean."
}

Usage

Use when preparing data for PPO or REINFORCE training. The input is JSONL with prompt text only (no responses). The same dataset builder and format is shared between PPO and REINFORCE workflows.

Key differences from other dataset types:

SFT datasets — Contain prompt + response pairs for supervised training
DPO datasets — Contain prompt + chosen/rejected pairs for direct preference optimization
RLHF datasets — Contain prompts only; responses are generated by the policy during training

Configuration:

model:
  data:
    data_prefix:
      train: ["/path/to/prompts.jsonl"]
      validation: ["/path/to/val_prompts.jsonl"]

Theoretical Basis

In online RL for language models, the policy generates responses to prompts, and a reward signal guides optimization. The prompt dataset defines the task distribution — the set of instructions or queries that the model will learn to respond to.

The dataset builder is a partial application of build_train_valid_test_datasets specialized to RLHFDataset. This design enables code reuse across different dataset types while maintaining a clean separation between the prompt-only format required by RLHF and the paired formats used by supervised methods.

Training Loop:
  1. Sample prompt from RLHFDataset
  2. Actor model generates response (rollout)
  3. Reward model scores the response
  4. Policy gradient updates the actor

Related Pages

Implementation:NVIDIA_NeMo_Aligner_Build_RLHF_Datasets

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment