Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Huggingface Trl PPO Prompt Dataset Preparation

From Leeroopedia
Revision as of 17:25, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Huggingface_Trl_PPO_Prompt_Dataset_Preparation.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Property Value
Principle Name PPO Prompt Dataset Preparation
Technology Huggingface TRL
Category Data Preprocessing
Workflow PPO RLHF Training
Implementation Implementation:Huggingface_Trl_PPO_Dataset_Tokenization

Overview

Description

Unlike supervised fine-tuning which trains on complete input-output pairs, PPO RLHF training operates in an online setting where the model generates its own responses during training. The dataset therefore contains only prompts (queries) which are tokenized and fed to the policy model for response generation. The generated responses are then scored by the reward model to produce the training signal.

This prompt-only format is a fundamental difference from reward model training, where both chosen and rejected responses are provided. In PPO, the model must learn to generate high-reward responses through trial and error, guided by the reward signal.

Usage

The prompt dataset is loaded, split into train and evaluation subsets, tokenized using the left-padded tokenizer, and passed to PPOTrainer as train_dataset and eval_dataset.

Theoretical Basis

Left-Padded Tokenization

A critical detail for PPO training is that the tokenizer must use left-side padding (padding_side="left"). This is because:

  • The policy model generates text autoregressively from left to right.
  • With left padding, all sequences in a batch end at the same position (the right edge), allowing the model to start generation from the same logical position.
  • Right padding would cause the model to generate from different starting positions in a batch, producing inconsistent behavior.

The tokenization step converts raw text prompts into token ID sequences without padding (padding is applied later during batch collation).

Prompt-Only Format

The dataset must contain a text field (typically "prompt" or "query") that provides the input for response generation. All other columns are removed during the mapping step to produce a clean dataset with only the "input_ids" column.

The prompt is the complete context presented to the model before generation begins. For chat-style interactions, this would include the system message and user query. For instruction-following tasks, this would include the instruction prefix.

Train/Eval Splitting

The dataset is split into training and evaluation subsets:

  • Evaluation set: A small held-out portion (typically 100 samples) used for periodic generation quality checks during training.
  • Training set: The remaining samples used for online rollout generation and PPO optimization.

The evaluation set serves a different purpose than in supervised learning. Rather than measuring loss on held-out data, evaluation in PPO involves:

  • Generating completions from evaluation prompts using near-greedy decoding (temperature close to 0).
  • Scoring completions with the reward model.
  • Logging generation examples for qualitative review.

Online vs Offline RL

PPO is an online RL algorithm, meaning the training data is generated by the current policy during training. This contrasts with offline methods (like DPO) where the training data is fixed. The prompt dataset serves as the seed for online data generation:

  1. A batch of prompts is sampled from the dataset.
  2. The policy model generates responses for each prompt.
  3. The reward model scores the generated responses.
  4. PPO uses these scores to update the policy.

This cycle repeats continuously, with the policy improving its generation quality over time.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment