Principle:Alibaba ROLL Knowledge Distillation Configuration
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Knowledge_Distillation, Configuration |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
A configuration principle for setting up knowledge distillation from a large teacher LLM to a smaller student LLM with configurable KL divergence objectives.
Description
Knowledge Distillation Configuration manages the parameters for teacher-student training, including:
- KD objective selection: Six KL divergence variants (forward_kl, reverse_kl, adaptive_kl, skewed_forward_kl, skewed_reverse_kl, JS divergence)
- Temperature parameters: Separate temperatures for teacher and student softmax distributions
- Loss blending: Configurable weight between SFT loss and distillation loss
- Logits transfer backend: Three communication modes for cross-cluster logit transfer (IPC+NCCL, NCCL-only, Ray)
- Top-k logits: Number of teacher logits to transfer for memory efficiency
Usage
Use when setting up a distillation pipeline to compress a large teacher model into a smaller student model.
Theoretical Basis
Knowledge distillation minimizes the divergence between teacher and student distributions:
Where is one of six KL divergence objectives applied to softened distributions.
Related Pages
Implemented By
Related Heuristics
No specific heuristics inform this principle.
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment