Principle:Huggingface Trl PPO Prompt Dataset Preparation

Property	Value
Principle Name	PPO Prompt Dataset Preparation
Technology	Huggingface TRL
Category	Data Preprocessing
Workflow	PPO RLHF Training
Implementation	Implementation:Huggingface_Trl_PPO_Dataset_Tokenization

Overview

Description

Unlike supervised fine-tuning which trains on complete input-output pairs, PPO RLHF training operates in an online setting where the model generates its own responses during training. The dataset therefore contains only prompts (queries) which are tokenized and fed to the policy model for response generation. The generated responses are then scored by the reward model to produce the training signal.

This prompt-only format is a fundamental difference from reward model training, where both chosen and rejected responses are provided. In PPO, the model must learn to generate high-reward responses through trial and error, guided by the reward signal.

Usage

The prompt dataset is loaded, split into train and evaluation subsets, tokenized using the left-padded tokenizer, and passed to PPOTrainer as train_dataset and eval_dataset.

Theoretical Basis

Left-Padded Tokenization

A critical detail for PPO training is that the tokenizer must use left-side padding (padding_side="left"). This is because:

The policy model generates text autoregressively from left to right.
With left padding, all sequences in a batch end at the same position (the right edge), allowing the model to start generation from the same logical position.
Right padding would cause the model to generate from different starting positions in a batch, producing inconsistent behavior.

The tokenization step converts raw text prompts into token ID sequences without padding (padding is applied later during batch collation).

Prompt-Only Format

The dataset must contain a text field (typically "prompt" or "query") that provides the input for response generation. All other columns are removed during the mapping step to produce a clean dataset with only the "input_ids" column.

The prompt is the complete context presented to the model before generation begins. For chat-style interactions, this would include the system message and user query. For instruction-following tasks, this would include the instruction prefix.

Train/Eval Splitting

The dataset is split into training and evaluation subsets:

Evaluation set: A small held-out portion (typically 100 samples) used for periodic generation quality checks during training.
Training set: The remaining samples used for online rollout generation and PPO optimization.

The evaluation set serves a different purpose than in supervised learning. Rather than measuring loss on held-out data, evaluation in PPO involves:

Generating completions from evaluation prompts using near-greedy decoding (temperature close to 0).
Scoring completions with the reward model.
Logging generation examples for qualitative review.

Online vs Offline RL

PPO is an online RL algorithm, meaning the training data is generated by the current policy during training. This contrasts with offline methods (like DPO) where the training data is fixed. The prompt dataset serves as the seed for online data generation:

A batch of prompts is sampled from the dataset.
The policy model generates responses for each prompt.
The reward model scores the generated responses.
PPO uses these scores to update the policy.

This cycle repeats continuously, with the policy improving its generation quality over time.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment