Principle:Eric mitchell Direct preference optimization Dropout Disabling
| Knowledge Sources | |
|---|---|
| Domains | Regularization, Training_Stability, Deep_Learning |
| Last Updated | 2026-02-08 02:00 GMT |
Overview
A training configuration technique that disables all dropout layers in a model to ensure deterministic and stable preference optimization.
Description
Dropout disabling sets all dropout probabilities to zero in a pre-trained model before DPO or SFT training. While dropout is useful during pre-training as a regularization technique, it introduces stochasticity that can be harmful during preference optimization:
- In DPO, the loss compares log probabilities from the policy and reference models. Dropout noise could cause inconsistent probability estimates between the two models, destabilizing training.
- For SFT fine-tuning on small preference datasets, the regularization benefit of dropout is minimal compared to the noise it introduces.
- Disabling dropout ensures reproducible forward passes, which is important for the comparison between policy and reference model outputs.
Usage
Apply this technique immediately after loading any model that will be used in DPO or SFT training. Both the policy and reference models should have dropout disabled.
Theoretical Basis
Dropout randomly zeros elements of the input tensor with probability during training, and scales the remaining elements by . Setting makes the forward pass deterministic.
For DPO specifically, the loss depends on the difference of log-ratios between policy and reference models. Dropout noise on either side would add variance to the gradient estimates without providing useful regularization signal, since the models are already pre-trained and the fine-tuning dataset is relatively small.
Pseudo-code:
# Abstract algorithm (NOT actual implementation)
for each module in model:
if module is Dropout:
module.probability = 0