Principle:CarperAI Trlx SFT Configuration

Knowledge Sources	Training language models to follow instructions with human feedback CarperAI trlx
Domains	Supervised_Learning, NLP, Configuration
Last Updated	2026-02-07 16:00 GMT

Overview

A configuration principle that defines the hyperparameters for supervised fine-tuning of language models on text or instruction-following datasets.

Description

Supervised Fine-Tuning (SFT) is the process of training a pre-trained language model on curated text data using the standard next-token prediction objective (cross-entropy loss). In the RLHF pipeline, SFT is typically the first stage: the base model is fine-tuned on demonstration data before reward model training and RL optimization. SFT configuration is simpler than PPO or ILQL since it does not require RL-specific parameters, but generation kwargs are still needed for periodic evaluation during training.

Usage

Use SFT configuration when you want to fine-tune a language model on a dataset of text samples or instruction-response pairs. SFT is appropriate when you have high-quality demonstration data and want the model to learn to produce similar outputs. It serves as the foundation stage in RLHF pipelines and as a standalone method for instruction tuning.

Theoretical Basis

SFT minimizes the standard autoregressive cross-entropy loss:

$L_{S F T} (θ) = - \sum_{t = 1}^{T} \log p_{θ} (x_{t} | x_{< t})$

For dialogue-format data with prompt-completion pairs, the loss is masked to only compute on completion tokens:

$L_{S F T} (θ) = - \sum_{t \in completion} \log p_{θ} (x_{t} | x_{< t})$

Key configuration concerns:

seq_length → Maximum sequence length for truncation
batch_size → Training batch size (affects gradient noise)
learning_rate → Controls step size (typically 1e-5 to 1e-4)
num_layers_unfrozen → Controls partial freezing (-1 for all layers)
gen_kwargs → Generation parameters for periodic evaluation

Related Pages

Implemented By

Implementation:CarperAI_Trlx_Default_SFT_Config

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment