Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA NeMo Aligner RLHF Prompt Data Preparation

From Leeroopedia


Principle Metadata
Type Principle
Domains NLP, Data_Engineering
Last Updated 2026-02-07 00:00 GMT
Related Implementation Implementation:NVIDIA_NeMo_Aligner_Build_RLHF_Datasets

Overview

Process of constructing prompt-only datasets for online reinforcement learning alignment where responses are generated during training.

Description

Unlike supervised methods (SFT, DPO) which train on pre-existing prompt-response pairs, RLHF methods (PPO, REINFORCE) require prompt-only datasets. The model generates responses online during training, which are then scored by the reward model.

This data preparation step:

  • Tokenizes and pads prompts from JSONL files into the RLHFDataset format
  • Produces prompt-only tensors with no response component
  • Shares the same dataset builder and format between PPO and REINFORCE workflows

The dataset is intentionally simpler than SFT/DPO datasets since it only needs to provide prompts — responses are produced by the actor model during rollouts.

Input format example:

{
  "prompt": "Write a short poem about the ocean."
}

Usage

Use when preparing data for PPO or REINFORCE training. The input is JSONL with prompt text only (no responses). The same dataset builder and format is shared between PPO and REINFORCE workflows.

Key differences from other dataset types:

  • SFT datasets — Contain prompt + response pairs for supervised training
  • DPO datasets — Contain prompt + chosen/rejected pairs for direct preference optimization
  • RLHF datasets — Contain prompts only; responses are generated by the policy during training

Configuration:

model:
  data:
    data_prefix:
      train: ["/path/to/prompts.jsonl"]
      validation: ["/path/to/val_prompts.jsonl"]

Theoretical Basis

In online RL for language models, the policy generates responses to prompts, and a reward signal guides optimization. The prompt dataset defines the task distribution — the set of instructions or queries that the model will learn to respond to.

The dataset builder is a partial application of build_train_valid_test_datasets specialized to RLHFDataset. This design enables code reuse across different dataset types while maintaining a clean separation between the prompt-only format required by RLHF and the paired formats used by supervised methods.

Training Loop:
  1. Sample prompt from RLHFDataset
  2. Actor model generates response (rollout)
  3. Reward model scores the response
  4. Policy gradient updates the actor

Related Pages

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment