Principle:CarperAI Trlx Supervised Fine Tuning

Knowledge Sources	Training language models to follow instructions with human feedback Self-Instruct: Aligning Language Model with Self Generated Instructions CarperAI trlx
Domains	Supervised_Learning, NLP, Training
Last Updated	2026-02-07 16:00 GMT

Overview

A training principle for fine-tuning language models on curated text or instruction-following datasets using the standard next-token prediction objective.

Description

Supervised Fine-Tuning (SFT) adapts a pre-trained language model to a specific task or style by training on demonstration data with cross-entropy loss. In the RLHF pipeline, SFT is the first stage that teaches the model the basic format and quality of desired outputs before RL optimization refines it further. SFT can also be used standalone for instruction tuning, domain adaptation, or style transfer.

trlx supports two SFT data formats: plain text strings (where the entire sequence is used as training target) and prompt-completion pairs (where loss is masked to only compute on completion tokens). The latter uses a DialogStore that tracks which tokens are prompt vs. output via a dialogue tokenization scheme.

Usage

Use SFT when you have high-quality demonstration data and want the model to learn to produce similar outputs. SFT is appropriate as: (1) the first stage of an RLHF pipeline, (2) a standalone instruction-tuning method, (3) domain adaptation with in-domain text. Use SFT over RL when you have enough demonstration data and do not need to optimize for a specific reward signal.

Theoretical Basis

SFT minimizes the negative log-likelihood of target tokens:

$L_{S F T} (θ) = - \sum_{t = 1}^{T} w_{t} \cdot \log p_{θ} (x_{t} | x_{< t})$

Where $w_{t} = 1$ for completion tokens and $w_{t} = 0$ (label = -100) for prompt tokens in dialogue format.

Two data modes in trlx:

Plain text: List[str] → All tokens used as targets via PromptPipeline
Dialogue pairs: List[List[str]] → Alternating [prompt, output, prompt, output, ...] → Loss masked on prompts via DialogStore

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment