Workflow:OpenRLHF OpenRLHF Knowledge Distillation
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Knowledge_Distillation, Model_Compression |
| Last Updated | 2026-02-07 10:00 GMT |
Overview
End-to-end process for compressing a large teacher language model into a smaller student model using token-level knowledge distillation with KL divergence.
Description
This workflow trains a smaller student model to mimic the output distribution of a larger teacher model. For each training example, the teacher model produces a probability distribution over tokens, and the student is trained to match this distribution using KL divergence loss combined with standard cross-entropy loss on the ground truth. The teacher model can be offloaded to CPU memory to enable distillation from very large models (e.g., 70B teacher to 8B student). The KD coefficient controls the balance between learning from the teacher and learning from the ground truth labels.
Usage
Execute this workflow when you want to transfer knowledge from a large, high-quality model into a smaller, deployable model. This is useful when the teacher model is too large for production inference but its quality needs to be preserved. Knowledge distillation typically produces better small models than training from scratch on the same data.
Execution Steps
Step 1: Configure distributed strategy
Initialize the DeepSpeed training strategy. Configure for the combined memory requirements of both teacher and student models. The teacher model can be offloaded to CPU to reduce GPU memory pressure.
Key considerations:
- ZeRO-3 is recommended to shard the student model across GPUs
- CPU offloading of the teacher model enables distillation from very large teachers
- bf16 mixed precision for both models
Step 2: Load teacher and student models
Load the large teacher model in evaluation mode (frozen, no gradients). Load the smaller student model in training mode. The teacher is typically a high-quality instruction-tuned model, and the student is a smaller base or instruction-tuned model of the same family.
Key considerations:
- The teacher model must be compatible with the student tokenizer (same vocabulary)
- CPU offloading flag controls whether the teacher resides in CPU or GPU memory
- The teacher model is never updated during training
Step 3: Prepare training dataset
Load the instruction-response dataset and tokenize with the shared tokenizer. Create the SFT dataset with appropriate chat templates and loss masks for computing loss only on response tokens.
Key considerations:
- The dataset format is the same as standard SFT training
- Chat templates must match both teacher and student model families
- Sample packing can improve throughput
Step 4: Setup optimizer and scheduler
Configure the optimizer for the student model only. Set learning rate and scheduling appropriate for knowledge distillation.
Key considerations:
- Learning rates are similar to SFT (e.g., 5e-6)
- The KD coefficient (e.g., 0.4) balances teacher KL loss vs. ground truth cross-entropy
Step 5: Train with distillation objective
Execute the training loop with the combined distillation objective. For each batch, run a forward pass through both teacher and student. Compute the KL divergence between teacher and student output distributions, and the standard cross-entropy loss on ground truth. The total loss is a weighted combination: loss = (1 - kd_coef) * ce_loss + kd_coef * kl_loss.
Key considerations:
- The teacher forward pass runs in no-gradient mode for efficiency
- KL divergence is computed token-by-token on the response tokens
- The KD coefficient controls the balance between imitation and ground truth learning
- Monitor both the KL loss (teacher matching) and CE loss (ground truth accuracy)
Step 6: Save student model
Save the trained student model weights and tokenizer. The student model is now ready for deployment or further fine-tuning.
Key considerations:
- Only the student model is saved
- The student can be further aligned with DPO or PPO if needed