Heuristic:Huggingface Alignment handbook DDP Bias Buffer Ignore
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Training, Debugging |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Workaround for PyTorch DDP: ignore boolean bias buffers to prevent synchronization errors in distributed preference training.
Description
The DPO and ORPO training scripts contain a torch distributed hack that tells PyTorch's DistributedDataParallel (DDP) to ignore boolean-typed buffers during gradient synchronization. Some model architectures store boolean attention masks as named buffers, which causes DDP to fail during all-reduce operations because boolean tensors are not supported by NCCL.
Usage
Apply this when running DPO or ORPO training in distributed mode (multi-GPU) and encountering NCCL errors related to boolean buffers. This is controlled by the `ignore_bias_buffers` flag in ScriptArguments.
The Insight (Rule of Thumb)
- Action: Set `ignore_bias_buffers: true` in the script arguments, or ensure the code filters out boolean buffers from DDP synchronization.
- Value: Prevents NCCL RuntimeErrors during distributed training.
- Trade-off: The ignored buffers are not synchronized across processes, but boolean masks are typically identical across GPUs so this is safe.
Reasoning
NCCL (the communication backend for multi-GPU training) does not support all-reduce on boolean tensors. Some transformer models store attention masks as named buffers, which DDP tries to synchronize by default. The fix explicitly excludes these from synchronization.
Code evidence from `scripts/dpo.py:105-109`:
if script_args.ignore_bias_buffers:
# torch distributed hack
model._ddp_params_and_buffers_to_ignore = [
name for name, buffer in model.named_buffers() if buffer.dtype == torch.bool
]
Identical pattern in `scripts/orpo.py:105-109`:
if script_args.ignore_bias_buffers:
# torch distributed hack
model._ddp_params_and_buffers_to_ignore = [
name for name, buffer in model.named_buffers() if buffer.dtype == torch.bool
]
The comment torch distributed hack in the source code explicitly flags this as tribal knowledge.