Principle:Allenai Open instruct SFT Data Processing

Knowledge Sources	Training Language Models to Follow Instructions with Human Feedback Tulu 3: Pushing Frontiers in Open Language Model Post-Training Open Instruct
Domains	Machine Learning, Natural Language Processing, Data Engineering
Last Updated	2026-02-07 00:00 GMT

Overview

SFT data processing is the technique of applying chat templates to instruction-following conversations and masking non-assistant tokens so that the training loss is computed only on the model's target outputs.

Description

Supervised fine-tuning trains a language model to generate helpful, accurate responses given an instruction. The data processing step bridges the gap between raw conversation data (lists of user/assistant message pairs) and the tokenized tensors needed for training. Two critical transformations occur:

Chat template application: Each conversation is formatted using a chat template that inserts role-specific markers (e.g., <|user|>, <|assistant|>) and special tokens. The template is applied twice: once to all messages (to get the full input_ids) and once to only the prompt messages (excluding the final assistant response) to determine the prompt boundary.

Label masking: The labels tensor is a copy of input_ids, but with all tokens belonging to the user prompt masked to -100 (the PyTorch ignore index). This ensures the cross-entropy loss is computed only on assistant-generated tokens, preventing the model from being penalized for "generating" the user's input. When train_only_on_prompt is False (the default for SFT), the prompt tokens up to (but not including) the final assistant turn are masked.

Filtering: After tokenization, examples that exceed maximum length thresholds or contain no valid labels are filtered out. This prevents wasted computation on examples that would be truncated beyond usefulness or contribute zero gradient.

Usage

Use SFT data processing when preparing instruction-following datasets for supervised fine-tuning. This is applicable to any chat-format dataset where the model should learn to produce assistant responses given user prompts.

Theoretical Basis

The SFT loss function with label masking is:

L_SFT = - (1 / |A|) * sum_{t in A} log P(x_t | x_{<t})

Where:

A is the set of token positions belonging to assistant responses
x_t is the token at position t
P(x_t | x_{<t}) is the model's predicted probability of token x_t given all preceding tokens
|A| is the number of assistant tokens

The masking is achieved by setting labels to -100 for non-assistant positions:

labels[t] = input_ids[t]   if t in A   (assistant tokens)
labels[t] = -100           if t not in A   (prompt/system tokens)

PyTorch's CrossEntropyLoss automatically ignores positions where the label is -100.

Prompt boundary detection: The boundary between prompt and response is found by tokenizing the conversation without the final assistant message (using add_generation_prompt=True). The length of this prompt tokenization gives the index at which assistant tokens begin:

prompt_ids = apply_chat_template(messages[:-1], add_generation_prompt=True)
full_ids   = apply_chat_template(messages)
labels[:len(prompt_ids)] = -100  # mask the prompt
labels[len(prompt_ids):] = full_ids[len(prompt_ids):]  # keep assistant tokens

Related Pages

Implemented By

Implementation:Allenai_Open_instruct_SFTDatasetProcessor

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment