Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:OpenRLHF OpenRLHF Knowledge Distillation

From Leeroopedia


Knowledge Sources
Domains LLMs, Knowledge_Distillation, Model_Compression
Last Updated 2026-02-07 10:00 GMT

Overview

End-to-end process for compressing a large teacher language model into a smaller student model using token-level knowledge distillation with KL divergence.

Description

This workflow trains a smaller student model to mimic the output distribution of a larger teacher model. For each training example, the teacher model produces a probability distribution over tokens, and the student is trained to match this distribution using KL divergence loss combined with standard cross-entropy loss on the ground truth. The teacher model can be offloaded to CPU memory to enable distillation from very large models (e.g., 70B teacher to 8B student). The KD coefficient controls the balance between learning from the teacher and learning from the ground truth labels.

Usage

Execute this workflow when you want to transfer knowledge from a large, high-quality model into a smaller, deployable model. This is useful when the teacher model is too large for production inference but its quality needs to be preserved. Knowledge distillation typically produces better small models than training from scratch on the same data.

Execution Steps

Step 1: Configure distributed strategy

Initialize the DeepSpeed training strategy. Configure for the combined memory requirements of both teacher and student models. The teacher model can be offloaded to CPU to reduce GPU memory pressure.

Key considerations:

  • ZeRO-3 is recommended to shard the student model across GPUs
  • CPU offloading of the teacher model enables distillation from very large teachers
  • bf16 mixed precision for both models

Step 2: Load teacher and student models

Load the large teacher model in evaluation mode (frozen, no gradients). Load the smaller student model in training mode. The teacher is typically a high-quality instruction-tuned model, and the student is a smaller base or instruction-tuned model of the same family.

Key considerations:

  • The teacher model must be compatible with the student tokenizer (same vocabulary)
  • CPU offloading flag controls whether the teacher resides in CPU or GPU memory
  • The teacher model is never updated during training

Step 3: Prepare training dataset

Load the instruction-response dataset and tokenize with the shared tokenizer. Create the SFT dataset with appropriate chat templates and loss masks for computing loss only on response tokens.

Key considerations:

  • The dataset format is the same as standard SFT training
  • Chat templates must match both teacher and student model families
  • Sample packing can improve throughput

Step 4: Setup optimizer and scheduler

Configure the optimizer for the student model only. Set learning rate and scheduling appropriate for knowledge distillation.

Key considerations:

  • Learning rates are similar to SFT (e.g., 5e-6)
  • The KD coefficient (e.g., 0.4) balances teacher KL loss vs. ground truth cross-entropy

Step 5: Train with distillation objective

Execute the training loop with the combined distillation objective. For each batch, run a forward pass through both teacher and student. Compute the KL divergence between teacher and student output distributions, and the standard cross-entropy loss on ground truth. The total loss is a weighted combination: loss = (1 - kd_coef) * ce_loss + kd_coef * kl_loss.

Key considerations:

  • The teacher forward pass runs in no-gradient mode for efficiency
  • KL divergence is computed token-by-token on the response tokens
  • The KD coefficient controls the balance between imitation and ground truth learning
  • Monitor both the KL loss (teacher matching) and CE loss (ground truth accuracy)

Step 6: Save student model

Save the trained student model weights and tokenizer. The student model is now ready for deployment or further fine-tuning.

Key considerations:

  • Only the student model is saved
  • The student can be further aligned with DPO or PPO if needed

Execution Diagram

GitHub URL

Workflow Repository