Principle:Alibaba ROLL Knowledge Distillation Configuration

Knowledge Sources	Knowledge Distillation MiniLLM Alibaba ROLL
Domains	Knowledge_Distillation, Configuration
Last Updated	2026-02-07 20:00 GMT

Overview

A configuration principle for setting up knowledge distillation from a large teacher LLM to a smaller student LLM with configurable KL divergence objectives.

Description

Knowledge Distillation Configuration manages the parameters for teacher-student training, including:

KD objective selection: Six KL divergence variants (forward_kl, reverse_kl, adaptive_kl, skewed_forward_kl, skewed_reverse_kl, JS divergence)
Temperature parameters: Separate temperatures for teacher and student softmax distributions
Loss blending: Configurable weight between SFT loss and distillation loss
Logits transfer backend: Three communication modes for cross-cluster logit transfer (IPC+NCCL, NCCL-only, Ray)
Top-k logits: Number of teacher logits to transfer for memory efficiency

Usage

Use when setting up a distillation pipeline to compress a large teacher model into a smaller student model.

Theoretical Basis

Knowledge distillation minimizes the divergence between teacher and student distributions: $L = (1 - α) L_{S F T} + α L_{K D}$

Where $L_{K D}$ is one of six KL divergence objectives applied to softened distributions.

Related Pages

Implemented By

Implementation:Alibaba_ROLL_DistillConfig

Related Heuristics

No specific heuristics inform this principle.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment