Principle:Alibaba ROLL Distillation Dataset Preparation
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Knowledge_Distillation |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
A data preprocessing principle for preparing instruction-response data for knowledge distillation with optional prompt-inclusive distillation.
Description
Distillation Dataset Preparation follows the SFT data pipeline with an additional option: distill_on_prompt. When enabled, prompt tokens are also included in the distillation loss computation (not masked). This can improve distillation quality by teaching the student to match the teacher's prompt representations.
Usage
Use when preparing data for knowledge distillation training.
Theoretical Basis
When distill_on_prompt is True, all tokens contribute to the distillation loss. When False, only response tokens contribute (prompt masked with -100).
Related Pages
Implemented By
Related Heuristics
No specific heuristics inform this principle.