Principle:Alibaba ROLL Knowledge Distillation Loss

Knowledge Sources	Knowledge Distillation MiniLLM GKD Alibaba ROLL
Domains	Knowledge_Distillation, Optimization
Last Updated	2026-02-07 20:00 GMT

Overview

A loss computation principle implementing six KL divergence objectives for knowledge distillation between teacher and student LLMs.

Knowledge Distillation Loss computes the divergence between teacher and student output distributions. Six objectives are supported:

The total loss blends SFT loss with distillation loss: L = (1-w)*L_SFT + w*L_KD

Use during student training step in knowledge distillation pipelines.

$D_{K L} (p_{T} ‖ p_{S}) = \sum_{i} p_{T} (i) \log \frac{p_{T} (i)}{p_{S} (i)}$

$D_{K L} (p_{S} ‖ p_{T}) = \sum_{i} p_{S} (i) \log \frac{p_{S} (i)}{p_{T} (i)}$

$D_{J S} = \frac{1}{2} D_{K L} (p_{T} ‖ M) + \frac{1}{2} D_{K L} (p_{S} ‖ M), M = \frac{p_{T} + p_{S}}{2}$

The following heuristics inform this principle:

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment