Principle:Alibaba ROLL SFT Dataset Preparation

Knowledge Sources	HuggingFace Datasets Alibaba ROLL
Domains	Data_Processing, Supervised_Learning
Last Updated	2026-02-07 20:00 GMT

Overview

A data preprocessing principle for converting instruction-response datasets into label-masked, shifted sequences for causal language model fine-tuning.

Description

SFT Dataset Preparation tokenizes instruction-response pairs using chat templates, masks prompt tokens with IGNORE_INDEX (-100) so they do not contribute to the loss, and shifts labels left by one position for next-token prediction. The DataCollatorForSFT handles padding and label shifting during batching.

Usage

Use when preparing data for supervised fine-tuning of causal language models.

Theoretical Basis

Label masking ensures only response tokens contribute to the loss:

Prompt tokens: label = -100 (ignored)
Response tokens: label = next token ID (standard causal LM objective)

Related Pages

Implemented By

Implementation:Alibaba_ROLL_SFT_Get_Encode_Function

Related Heuristics

No specific heuristics inform this principle.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment