Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:LLMBook zh LLMBook zh github io Supervised Finetuning

From Leeroopedia


Knowledge Sources
Domains LLMs, Fine_Tuning, Instruction_Tuning
Last Updated 2026-02-08 04:30 GMT

Overview

End-to-end supervised fine-tuning (SFT) workflow for adapting a pre-trained language model to follow instructions, using template-formatted data with selective loss masking on response tokens only.

Description

This workflow covers the process of fine-tuning a pre-trained causal language model on instruction-following data. The key distinction from pre-training is that loss is computed only on the response portion of each training example, not on the instruction prompt. Training data is formatted using a standardized instruction template (instruction/input/output format), tokenized with proper boundary detection between prompt and response, and the prompt tokens are masked with an ignore index so they do not contribute to the training loss. This selective masking teaches the model to generate appropriate responses while conditioning on instructions, without wasting gradient signal on predicting instruction tokens.

Usage

Execute this workflow when you have a pre-trained base language model and an instruction-following dataset (e.g., Alpaca-style JSON with instruction/input/output fields) and want to teach the model to follow instructions and produce helpful responses. This is the standard approach for creating a chat-capable model from a base model.

Execution Steps

Step 1: Data Formatting

Transform raw instruction-following data into structured prompt-response pairs using a standardized template. Each example is formatted with an instruction header, optional input context, and a response section. Two template variants are used: one for examples with additional input context and one for instruction-only examples. The templates use a consistent structure with clear markers between the instruction and expected output.

Key considerations:

  • The template format uses "Instruction" and "Output" section markers
  • Examples with an "input" field use a context-aware template variant
  • The response field is stripped of leading/trailing whitespace for consistency

Step 2: Tokenization with Loss Masking

Tokenize each formatted example and create label masks that exclude prompt tokens from the loss computation. The prompt portion is tokenized separately to determine its length. The full prompt-plus-response sequence is tokenized with an end-of-sequence token appended. Labels are cloned from input IDs, then all token positions corresponding to the prompt are set to the ignore index (-100), ensuring that only response tokens contribute to the training loss.

Key considerations:

  • The EOS token is only appended for the full sequence, not for the prompt-only tokenization
  • The IGNORE_INDEX value (-100) is the standard PyTorch CrossEntropyLoss ignore value
  • Sequences exceeding the maximum length are truncated

Step 3: Data Collation

Batch multiple tokenized examples together using a custom data collator. Since examples have variable lengths, input IDs are padded to the longest sequence in the batch using the tokenizer's pad token. Labels are padded with the ignore index to ensure padding positions do not affect the loss. The collator returns dictionaries with padded input_ids and labels tensors.

Key considerations:

  • PyTorch's pad_sequence utility handles variable-length sequence padding
  • Padding tokens in labels use IGNORE_INDEX, not the pad token ID
  • Batch-first ordering is used for compatibility with the Trainer

Step 4: Model Loading

Load the pre-trained base model and tokenizer from a checkpoint. The model is loaded with FlashAttention enabled for efficient training. The tokenizer is configured with right-side padding and no automatic EOS token addition (EOS is handled explicitly during data preparation).

Key considerations:

  • FlashAttention requires compatible GPU hardware (Ampere or newer)
  • The tokenizer's max length is set to match the model's context window
  • Padding side is set to "right" for causal language model compatibility

Step 5: SFT Training

Initialize the HuggingFace Trainer with the model, dataset, data collator, and training arguments, then execute the training loop. The Trainer manages gradient computation, optimizer steps, learning rate scheduling, and checkpointing. BF16 mixed precision is enabled for memory efficiency. After training completes, the final model and trainer state are saved.

Key considerations:

  • The data collator is passed explicitly to handle the custom padding logic
  • Only model parameters are saved (not optimizer state) to reduce checkpoint size
  • The training uses the standard cross-entropy loss with ignore_index=-100 masking

Execution Diagram

GitHub URL

Workflow Repository