Principle:OpenRLHF OpenRLHF Supervised Fine Tuning Training

Knowledge Sources	Training language models to follow instructions with human feedback Scaling Instruction-Finetuned Language Models
Domains	NLP, Training
Last Updated	2026-02-07 00:00 GMT

Overview

A training methodology that fine-tunes a pretrained language model on instruction-response demonstrations using supervised cross-entropy loss on response tokens.

Description

Supervised Fine-Tuning (SFT) is typically the first stage of RLHF pipelines. It adapts a pretrained language model to follow instructions by training on curated demonstration data. The model learns to generate appropriate responses to prompts by minimizing the negative log-likelihood of response tokens, with prompt tokens masked from the loss.

SFT provides the initial policy for subsequent alignment stages (reward model training, PPO/DPO). The quality and diversity of the SFT dataset directly impacts the final aligned model's capabilities.

Usage

Use SFT as the starting point for any RLHF pipeline, or as a standalone training method when sufficient high-quality demonstration data is available. Also used in iterative training loops (rejection sampling, iterative DPO) to retrain on filtered data.

Theoretical Basis

The SFT objective minimizes token-level negative log-likelihood on response tokens: $L_{S F T} = - \frac{1}{| R |} \sum_{t \in R} \log π_{θ} (x_{t} | x_{< t})$

where $R$ is the set of response token indices and $π_{θ}$ is the model's output distribution.

OpenRLHF supports two loss computation modes:

Token-level: Average loss over all unmasked tokens across the batch
Sequence-level: Average per-sequence loss, then average over the batch

Related Pages

Implemented By

Implementation:OpenRLHF_OpenRLHF_SFTTrainer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment