Principle:Microsoft DeepSpeedExamples RLHF Data Preparation
Overview
A data processing methodology that prepares training datasets across three RLHF phases: supervised fine-tuning, reward modeling, and reinforcement learning.
Description
Reinforcement Learning from Human Feedback (RLHF) training requires different data formats for each of its three phases. A single underlying corpus of human-generated text and preference annotations must be transformed into phase-specific structures before it can be consumed by the corresponding training objective.
Phase 1 -- Supervised Fine-Tuning (SFT) needs prompt-response pairs. Each training example is the concatenation of a human prompt and the desired assistant response. The model is trained with a standard language-modeling (next-token prediction) loss over the full sequence. Formally, a sample takes the form (prompt + chosen_response), and a label mask is applied so that loss is computed on every token where the attention mask is active.
Phase 2 -- Reward Modeling needs chosen/rejected pairs. Each training example contains a prompt together with two candidate responses: one that human annotators preferred (chosen) and one they did not (rejected). The reward model learns to assign a higher scalar score to the chosen response. A sample takes the form (prompt + chosen_response, prompt + rejected_response).
Phase 3 -- RLHF / PPO needs prompts only. During the reinforcement-learning phase the actor model generates its own completions, which are then scored by the reward model. Therefore the dataset only supplies the prompt text. A sample takes the form (prompt). Prompts that exceed the maximum sequence length are filtered out, and token sequences are reversed (flipped) so that left-padding aligns correctly for generation.
The data preparation step must abstract over 15+ dataset sources (including Dahoas/rm-static, Dahoas/full-hh-rlhf, openai/webgpt_comparisons, stanfordnlp/SHP, and multilingual corpora such as wangrui6/Zhihu-KOL and Cohere/miracl-zh-queries-22-12) and produce a unified interface through a common base class (PromptRawDataset). Every dataset adapter implements the same set of accessor methods -- get_prompt, get_chosen, get_rejected, get_prompt_and_chosen, get_prompt_and_rejected -- so the downstream pipeline can treat all sources identically regardless of their native schema.
The same underlying dataset is split across the three phases using a configurable ratio string (e.g., "2,4,4" meaning 20% for SFT, 40% for reward modeling, 40% for RLHF). This ensures that each phase trains on non-overlapping portions of the data, which prevents data leakage between stages.
Usage
Use this principle when building multi-phase alignment training pipelines where each phase requires different data formatting from the same underlying datasets. It is applicable whenever:
- A single data corpus must be partitioned and reformatted across multiple training stages.
- Multiple heterogeneous dataset sources must be unified behind a common access API.
- Distributed training requires deterministic, cached, and reproducible data splits.
Theoretical Basis
The three-phase data paradigm follows the InstructGPT training procedure (Ouyang et al., 2022):
Phase 1: Supervised Fine-Tuning (SFT)
Uses (prompt, response) pairs for language modeling. The model learns to produce helpful responses by maximizing:
L_SFT = -E_(x,y)~D_sft [ sum_t log P(y_t | x, y_<t) ]
where x is the prompt, y is the chosen response, and the sum runs over all tokens.
Phase 2: Reward Modeling
Uses (prompt, chosen, rejected) triples for preference ranking. The reward model is trained with a pairwise ranking loss:
L_RM = -E_(x,y_w,y_l)~D_rm [ log sigma( r(x, y_w) - r(x, y_l) ) ]
where y_w is the preferred (chosen) response and y_l is the rejected response.
Phase 3: Reinforcement Learning (PPO)
Uses (prompt) only for generation + reward scoring. The actor generates completions that the reward model scores, and the policy is updated via PPO:
L_PPO = E_x~D_rlhf [ E_y~pi_theta(y|x) [ A(x, y) * min(r_t, clip(r_t, 1-eps, 1+eps)) ] ]
where A(x, y) is the advantage estimate derived from the reward model scores.
Data Routing Pseudocode
The following pseudocode illustrates how data is routed to the correct format depending on the active training phase:
def prepare_data(sample, raw_dataset, train_phase, tokenizer, max_seq_len, eos_token):
"""Route a raw sample to the correct format for the given training phase."""
if train_phase == 1:
# Phase 1 (SFT): concatenate prompt + chosen response
text = raw_dataset.get_prompt_and_chosen(sample)
text += eos_token
tokens = tokenizer(text, max_length=max_seq_len, padding="max_length", truncation=True)
labels = where(tokens.attention_mask, tokens.input_ids, -100)
return {"input_ids": tokens.input_ids, "attention_mask": tokens.attention_mask, "labels": labels}
elif train_phase == 2:
# Phase 2 (Reward): tokenize both chosen and rejected
chosen_text = raw_dataset.get_prompt_and_chosen(sample) + eos_token
rejected_text = raw_dataset.get_prompt_and_rejected(sample) + eos_token
chosen_tokens = tokenizer(chosen_text, max_length=max_seq_len, padding="max_length", truncation=True)
rejected_tokens = tokenizer(rejected_text, max_length=max_seq_len, padding="max_length", truncation=True)
return {
"chosen_input_ids": chosen_tokens.input_ids,
"chosen_attention_mask": chosen_tokens.attention_mask,
"rejected_input_ids": rejected_tokens.input_ids,
"rejected_attention_mask": rejected_tokens.attention_mask,
}
elif train_phase == 3:
# Phase 3 (RLHF/PPO): prompt only, reversed for left-padding
prompt_text = raw_dataset.get_prompt(sample)
prompt_tokens = tokenizer(prompt_text)
if len(prompt_tokens.input_ids) > max_seq_len:
return None # filter out prompts that are too long
prompt_tokens.input_ids = flip(prompt_tokens.input_ids)
prompt_tokens.attention_mask = flip(prompt_tokens.attention_mask)
return {"input_ids": prompt_tokens.input_ids, "attention_mask": prompt_tokens.attention_mask}