Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft DeepSpeedExamples RLHF Data Preparation

From Leeroopedia


Template:Metadata

Overview

A data processing methodology that prepares training datasets across three RLHF phases: supervised fine-tuning, reward modeling, and reinforcement learning.

Description

Reinforcement Learning from Human Feedback (RLHF) training requires different data formats for each of its three phases. A single underlying corpus of human-generated text and preference annotations must be transformed into phase-specific structures before it can be consumed by the corresponding training objective.

Phase 1 -- Supervised Fine-Tuning (SFT) needs prompt-response pairs. Each training example is the concatenation of a human prompt and the desired assistant response. The model is trained with a standard language-modeling (next-token prediction) loss over the full sequence. Formally, a sample takes the form (prompt + chosen_response), and a label mask is applied so that loss is computed on every token where the attention mask is active.

Phase 2 -- Reward Modeling needs chosen/rejected pairs. Each training example contains a prompt together with two candidate responses: one that human annotators preferred (chosen) and one they did not (rejected). The reward model learns to assign a higher scalar score to the chosen response. A sample takes the form (prompt + chosen_response, prompt + rejected_response).

Phase 3 -- RLHF / PPO needs prompts only. During the reinforcement-learning phase the actor model generates its own completions, which are then scored by the reward model. Therefore the dataset only supplies the prompt text. A sample takes the form (prompt). Prompts that exceed the maximum sequence length are filtered out, and token sequences are reversed (flipped) so that left-padding aligns correctly for generation.

The data preparation step must abstract over 15+ dataset sources (including Dahoas/rm-static, Dahoas/full-hh-rlhf, openai/webgpt_comparisons, stanfordnlp/SHP, and multilingual corpora such as wangrui6/Zhihu-KOL and Cohere/miracl-zh-queries-22-12) and produce a unified interface through a common base class (PromptRawDataset). Every dataset adapter implements the same set of accessor methods -- get_prompt, get_chosen, get_rejected, get_prompt_and_chosen, get_prompt_and_rejected -- so the downstream pipeline can treat all sources identically regardless of their native schema.

The same underlying dataset is split across the three phases using a configurable ratio string (e.g., "2,4,4" meaning 20% for SFT, 40% for reward modeling, 40% for RLHF). This ensures that each phase trains on non-overlapping portions of the data, which prevents data leakage between stages.

Usage

Use this principle when building multi-phase alignment training pipelines where each phase requires different data formatting from the same underlying datasets. It is applicable whenever:

  • A single data corpus must be partitioned and reformatted across multiple training stages.
  • Multiple heterogeneous dataset sources must be unified behind a common access API.
  • Distributed training requires deterministic, cached, and reproducible data splits.

Theoretical Basis

The three-phase data paradigm follows the InstructGPT training procedure (Ouyang et al., 2022):

Phase 1: Supervised Fine-Tuning (SFT)

Uses (prompt, response) pairs for language modeling. The model learns to produce helpful responses by maximizing:

L_SFT = -E_(x,y)~D_sft [ sum_t log P(y_t | x, y_<t) ]

where x is the prompt, y is the chosen response, and the sum runs over all tokens.

Phase 2: Reward Modeling

Uses (prompt, chosen, rejected) triples for preference ranking. The reward model is trained with a pairwise ranking loss:

L_RM = -E_(x,y_w,y_l)~D_rm [ log sigma( r(x, y_w) - r(x, y_l) ) ]

where y_w is the preferred (chosen) response and y_l is the rejected response.

Phase 3: Reinforcement Learning (PPO)

Uses (prompt) only for generation + reward scoring. The actor generates completions that the reward model scores, and the policy is updated via PPO:

L_PPO = E_x~D_rlhf [ E_y~pi_theta(y|x) [ A(x, y) * min(r_t, clip(r_t, 1-eps, 1+eps)) ] ]

where A(x, y) is the advantage estimate derived from the reward model scores.

Data Routing Pseudocode

The following pseudocode illustrates how data is routed to the correct format depending on the active training phase:

def prepare_data(sample, raw_dataset, train_phase, tokenizer, max_seq_len, eos_token):
    """Route a raw sample to the correct format for the given training phase."""

    if train_phase == 1:
        # Phase 1 (SFT): concatenate prompt + chosen response
        text = raw_dataset.get_prompt_and_chosen(sample)
        text += eos_token
        tokens = tokenizer(text, max_length=max_seq_len, padding="max_length", truncation=True)
        labels = where(tokens.attention_mask, tokens.input_ids, -100)
        return {"input_ids": tokens.input_ids, "attention_mask": tokens.attention_mask, "labels": labels}

    elif train_phase == 2:
        # Phase 2 (Reward): tokenize both chosen and rejected
        chosen_text = raw_dataset.get_prompt_and_chosen(sample) + eos_token
        rejected_text = raw_dataset.get_prompt_and_rejected(sample) + eos_token
        chosen_tokens = tokenizer(chosen_text, max_length=max_seq_len, padding="max_length", truncation=True)
        rejected_tokens = tokenizer(rejected_text, max_length=max_seq_len, padding="max_length", truncation=True)
        return {
            "chosen_input_ids": chosen_tokens.input_ids,
            "chosen_attention_mask": chosen_tokens.attention_mask,
            "rejected_input_ids": rejected_tokens.input_ids,
            "rejected_attention_mask": rejected_tokens.attention_mask,
        }

    elif train_phase == 3:
        # Phase 3 (RLHF/PPO): prompt only, reversed for left-padding
        prompt_text = raw_dataset.get_prompt(sample)
        prompt_tokens = tokenizer(prompt_text)
        if len(prompt_tokens.input_ids) > max_seq_len:
            return None  # filter out prompts that are too long
        prompt_tokens.input_ids = flip(prompt_tokens.input_ids)
        prompt_tokens.attention_mask = flip(prompt_tokens.attention_mask)
        return {"input_ids": prompt_tokens.input_ids, "attention_mask": prompt_tokens.attention_mask}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment