Principle:Deepspeedai DeepSpeed Supervised Fine Tuning

Overview

The first phase of the RLHF pipeline where a pretrained language model is fine-tuned on high-quality demonstration data using standard supervised learning.

Description

Supervised Fine-Tuning (SFT) adapts a pretrained language model to follow instructions by training on curated prompt-response pairs. This creates the base policy model that will be further refined through reinforcement learning. In the DeepSpeed RLHF pipeline, SFT uses standard deepspeed.initialize() with ZeRO Stage 2 or 3 optimization. No hybrid engine is needed at this stage since there is no inference or generation component.

The SFT phase is the foundation of the entire RLHF training pipeline. It transforms a general-purpose pretrained model into one that can produce coherent, instruction-following responses. The quality of the SFT model directly determines the upper bound of what the subsequent reinforcement learning phase can achieve. Because SFT involves only forward and backward passes without any text generation, it can rely entirely on the standard DeepSpeedEngine without the overhead of inference containers or mode switching.

In the DeepSpeed-Chat framework, SFT is referred to as Step 1. The resulting checkpoint is used to initialize both the actor model (in Step 3, the RLHF phase) and the reward model (in Step 2). The training loop follows the standard pattern: forward pass, loss computation via cross-entropy, backward pass, and optimizer step, all orchestrated by the DeepSpeedEngine with ZeRO memory optimization and mixed-precision training.

Theoretical Basis

SFT is grounded in maximum likelihood estimation on demonstration data. The objective is to minimize the negative log-likelihood of target responses given input prompts:

L = -sum(log P(y_t | y_<t, x))

where x is the input prompt and y is the target response. This is the standard cross-entropy loss with teacher forcing, where at each step the model conditions on the ground-truth prefix rather than its own previous predictions.

The SFT loss encourages the model to assign high probability to the exact tokens in the demonstration data. While this does not directly optimize for response quality or alignment with human preferences, it provides a strong initialization point. The model learns the basic structure of instruction-following responses, which is essential for the subsequent reward modeling and PPO stages to be effective.

References

InstructGPT: Training language models to follow instructions with human feedback — https://arxiv.org/abs/2203.02155

Related Pages

Implementation:Deepspeedai_DeepSpeed_Initialize_For_SFT

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment