Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Alibaba ROLL Knowledge Distillation Loss

From Leeroopedia


Knowledge Sources
Domains Knowledge_Distillation, Optimization
Last Updated 2026-02-07 20:00 GMT

Overview

A loss computation principle implementing six KL divergence objectives for knowledge distillation between teacher and student LLMs.

Description

Knowledge Distillation Loss computes the divergence between teacher and student output distributions. Six objectives are supported:

  • Forward KL: Standard KL(teacher || student) - mode-covering
  • Reverse KL: KL(student || teacher) - mode-seeking (MiniLLM)
  • Adaptive KL: Weighted combination of forward and reverse
  • Skewed Forward KL: Skewed interpolation toward teacher
  • Skewed Reverse KL: Skewed interpolation toward student
  • Jensen-Shannon: Symmetric divergence

The total loss blends SFT loss with distillation loss: L = (1-w)*L_SFT + w*L_KD

Usage

Use during student training step in knowledge distillation pipelines.

Theoretical Basis

Forward KL

DKL(pTpS)=ipT(i)logpT(i)pS(i)

Reverse KL

DKL(pSpT)=ipS(i)logpS(i)pT(i)

Jensen-Shannon

DJS=12DKL(pTM)+12DKL(pSM),M=pT+pS2

Related Pages

Implemented By

Related Heuristics

The following heuristics inform this principle:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment