Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:CarperAI Trlx SFT Configuration

From Leeroopedia


Knowledge Sources
Domains Supervised_Learning, NLP, Configuration
Last Updated 2026-02-07 16:00 GMT

Overview

A configuration principle that defines the hyperparameters for supervised fine-tuning of language models on text or instruction-following datasets.

Description

Supervised Fine-Tuning (SFT) is the process of training a pre-trained language model on curated text data using the standard next-token prediction objective (cross-entropy loss). In the RLHF pipeline, SFT is typically the first stage: the base model is fine-tuned on demonstration data before reward model training and RL optimization. SFT configuration is simpler than PPO or ILQL since it does not require RL-specific parameters, but generation kwargs are still needed for periodic evaluation during training.

Usage

Use SFT configuration when you want to fine-tune a language model on a dataset of text samples or instruction-response pairs. SFT is appropriate when you have high-quality demonstration data and want the model to learn to produce similar outputs. It serves as the foundation stage in RLHF pipelines and as a standalone method for instruction tuning.

Theoretical Basis

SFT minimizes the standard autoregressive cross-entropy loss:

LSFT(θ)=t=1Tlogpθ(xt|x<t)

For dialogue-format data with prompt-completion pairs, the loss is masked to only compute on completion tokens:

LSFT(θ)=tcompletionlogpθ(xt|x<t)

Key configuration concerns:

  • seq_length → Maximum sequence length for truncation
  • batch_size → Training batch size (affects gradient noise)
  • learning_rate → Controls step size (typically 1e-5 to 1e-4)
  • num_layers_unfrozen → Controls partial freezing (-1 for all layers)
  • gen_kwargs → Generation parameters for periodic evaluation

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment