Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:CarperAI Trlx Supervised Fine Tuning

From Leeroopedia


Knowledge Sources
Domains Supervised_Learning, NLP, Training
Last Updated 2026-02-07 16:00 GMT

Overview

A training principle for fine-tuning language models on curated text or instruction-following datasets using the standard next-token prediction objective.

Description

Supervised Fine-Tuning (SFT) adapts a pre-trained language model to a specific task or style by training on demonstration data with cross-entropy loss. In the RLHF pipeline, SFT is the first stage that teaches the model the basic format and quality of desired outputs before RL optimization refines it further. SFT can also be used standalone for instruction tuning, domain adaptation, or style transfer.

trlx supports two SFT data formats: plain text strings (where the entire sequence is used as training target) and prompt-completion pairs (where loss is masked to only compute on completion tokens). The latter uses a DialogStore that tracks which tokens are prompt vs. output via a dialogue tokenization scheme.

Usage

Use SFT when you have high-quality demonstration data and want the model to learn to produce similar outputs. SFT is appropriate as: (1) the first stage of an RLHF pipeline, (2) a standalone instruction-tuning method, (3) domain adaptation with in-domain text. Use SFT over RL when you have enough demonstration data and do not need to optimize for a specific reward signal.

Theoretical Basis

SFT minimizes the negative log-likelihood of target tokens:

LSFT(θ)=t=1Twtlogpθ(xt|x<t)

Where wt=1 for completion tokens and wt=0 (label = -100) for prompt tokens in dialogue format.

Two data modes in trlx:

  • Plain text: List[str] → All tokens used as targets via PromptPipeline
  • Dialogue pairs: List[List[str]] → Alternating [prompt, output, prompt, output, ...] → Loss masked on prompts via DialogStore

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment