Principle:Allenai Open instruct SFT Data Processing
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Natural Language Processing, Data Engineering |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
SFT data processing is the technique of applying chat templates to instruction-following conversations and masking non-assistant tokens so that the training loss is computed only on the model's target outputs.
Description
Supervised fine-tuning trains a language model to generate helpful, accurate responses given an instruction. The data processing step bridges the gap between raw conversation data (lists of user/assistant message pairs) and the tokenized tensors needed for training. Two critical transformations occur:
Chat template application: Each conversation is formatted using a chat template that inserts role-specific markers (e.g., <|user|>, <|assistant|>) and special tokens. The template is applied twice: once to all messages (to get the full input_ids) and once to only the prompt messages (excluding the final assistant response) to determine the prompt boundary.
Label masking: The labels tensor is a copy of input_ids, but with all tokens belonging to the user prompt masked to -100 (the PyTorch ignore index). This ensures the cross-entropy loss is computed only on assistant-generated tokens, preventing the model from being penalized for "generating" the user's input. When train_only_on_prompt is False (the default for SFT), the prompt tokens up to (but not including) the final assistant turn are masked.
Filtering: After tokenization, examples that exceed maximum length thresholds or contain no valid labels are filtered out. This prevents wasted computation on examples that would be truncated beyond usefulness or contribute zero gradient.
Usage
Use SFT data processing when preparing instruction-following datasets for supervised fine-tuning. This is applicable to any chat-format dataset where the model should learn to produce assistant responses given user prompts.
Theoretical Basis
The SFT loss function with label masking is:
L_SFT = - (1 / |A|) * sum_{t in A} log P(x_t | x_{<t})
Where:
Ais the set of token positions belonging to assistant responsesx_tis the token at position tP(x_t | x_{<t})is the model's predicted probability of token x_t given all preceding tokens|A|is the number of assistant tokens
The masking is achieved by setting labels to -100 for non-assistant positions:
labels[t] = input_ids[t] if t in A (assistant tokens)
labels[t] = -100 if t not in A (prompt/system tokens)
PyTorch's CrossEntropyLoss automatically ignores positions where the label is -100.
Prompt boundary detection: The boundary between prompt and response is found by tokenizing the conversation without the final assistant message (using add_generation_prompt=True). The length of this prompt tokenization gives the index at which assistant tokens begin:
prompt_ids = apply_chat_template(messages[:-1], add_generation_prompt=True)
full_ids = apply_chat_template(messages)
labels[:len(prompt_ids)] = -100 # mask the prompt
labels[len(prompt_ids):] = full_ids[len(prompt_ids):] # keep assistant tokens