Principle:Alibaba ROLL SFT Dataset Preparation
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Supervised_Learning |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
A data preprocessing principle for converting instruction-response datasets into label-masked, shifted sequences for causal language model fine-tuning.
Description
SFT Dataset Preparation tokenizes instruction-response pairs using chat templates, masks prompt tokens with IGNORE_INDEX (-100) so they do not contribute to the loss, and shifts labels left by one position for next-token prediction. The DataCollatorForSFT handles padding and label shifting during batching.
Usage
Use when preparing data for supervised fine-tuning of causal language models.
Theoretical Basis
Label masking ensures only response tokens contribute to the loss:
- Prompt tokens: label = -100 (ignored)
- Response tokens: label = next token ID (standard causal LM objective)
Related Pages
Implemented By
Related Heuristics
No specific heuristics inform this principle.