Principle:Alibaba ROLL Distillation Dataset Preparation

Knowledge Sources	Alibaba ROLL
Domains	Data_Processing, Knowledge_Distillation
Last Updated	2026-02-07 20:00 GMT

Overview

A data preprocessing principle for preparing instruction-response data for knowledge distillation with optional prompt-inclusive distillation.

Description

Distillation Dataset Preparation follows the SFT data pipeline with an additional option: distill_on_prompt. When enabled, prompt tokens are also included in the distillation loss computation (not masked). This can improve distillation quality by teaching the student to match the teacher's prompt representations.

Usage

Use when preparing data for knowledge distillation training.

Theoretical Basis

When distill_on_prompt is True, all tokens contribute to the distillation loss. When False, only response tokens contribute (prompt masked with -100).

Related Pages

Implemented By

Implementation:Alibaba_ROLL_Distill_Preprocess_Dataset

Related Heuristics

No specific heuristics inform this principle.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment