Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Alibaba ROLL Knowledge Distillation Configuration

From Leeroopedia


Knowledge Sources
Domains Knowledge_Distillation, Configuration
Last Updated 2026-02-07 20:00 GMT

Overview

A configuration principle for setting up knowledge distillation from a large teacher LLM to a smaller student LLM with configurable KL divergence objectives.

Description

Knowledge Distillation Configuration manages the parameters for teacher-student training, including:

  • KD objective selection: Six KL divergence variants (forward_kl, reverse_kl, adaptive_kl, skewed_forward_kl, skewed_reverse_kl, JS divergence)
  • Temperature parameters: Separate temperatures for teacher and student softmax distributions
  • Loss blending: Configurable weight between SFT loss and distillation loss
  • Logits transfer backend: Three communication modes for cross-cluster logit transfer (IPC+NCCL, NCCL-only, Ray)
  • Top-k logits: Number of teacher logits to transfer for memory efficiency

Usage

Use when setting up a distillation pipeline to compress a large teacher model into a smaller student model.

Theoretical Basis

Knowledge distillation minimizes the divergence between teacher and student distributions: L=(1α)LSFT+αLKD

Where LKD is one of six KL divergence objectives applied to softened distributions.

Related Pages

Implemented By

Related Heuristics

No specific heuristics inform this principle.

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment