Principle:OpenRLHF OpenRLHF Knowledge Distillation Training
| Knowledge Sources | |
|---|---|
| Domains | NLP, Training, Model_Compression |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A training technique that transfers knowledge from a large teacher model to a smaller student model by matching token-level probability distributions.
Description
Knowledge Distillation (KD) trains a student model to mimic a teacher model's output distribution. The student's loss combines a standard language modeling loss (cross-entropy with ground truth labels) and a distillation loss (KL divergence between teacher and student distributions). This allows the student to learn from both the explicit labels and the teacher's "dark knowledge" encoded in its soft probability distributions.
Usage
Use when you need to compress a large teacher model into a smaller student model while retaining as much capability as possible. The teacher model is frozen and used only for inference.
Theoretical Basis
The combined loss for knowledge distillation:
where:
- is the standard cross-entropy loss with ground truth
- is the forward KL divergence
- (kd_coef) controls the balance between CE and KD losses