Principle:Alibaba ROLL Knowledge Distillation Loss
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Knowledge_Distillation, Optimization |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
A loss computation principle implementing six KL divergence objectives for knowledge distillation between teacher and student LLMs.
Description
Knowledge Distillation Loss computes the divergence between teacher and student output distributions. Six objectives are supported:
- Forward KL: Standard KL(teacher || student) - mode-covering
- Reverse KL: KL(student || teacher) - mode-seeking (MiniLLM)
- Adaptive KL: Weighted combination of forward and reverse
- Skewed Forward KL: Skewed interpolation toward teacher
- Skewed Reverse KL: Skewed interpolation toward student
- Jensen-Shannon: Symmetric divergence
The total loss blends SFT loss with distillation loss: L = (1-w)*L_SFT + w*L_KD
Usage
Use during student training step in knowledge distillation pipelines.
Theoretical Basis
Forward KL
Reverse KL
Jensen-Shannon
Related Pages
Implemented By
Related Heuristics
The following heuristics inform this principle:
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment