Heuristic:Huggingface Alignment handbook DDP Bias Buffer Ignore

Knowledge Sources	Alignment Handbook Internal
Domains	Distributed_Training, Debugging
Last Updated	2026-02-07 00:00 GMT

Overview

Workaround for PyTorch DDP: ignore boolean bias buffers to prevent synchronization errors in distributed preference training.

Description

The DPO and ORPO training scripts contain a torch distributed hack that tells PyTorch's DistributedDataParallel (DDP) to ignore boolean-typed buffers during gradient synchronization. Some model architectures store boolean attention masks as named buffers, which causes DDP to fail during all-reduce operations because boolean tensors are not supported by NCCL.

Usage

Apply this when running DPO or ORPO training in distributed mode (multi-GPU) and encountering NCCL errors related to boolean buffers. This is controlled by the `ignore_bias_buffers` flag in ScriptArguments.

The Insight (Rule of Thumb)

Action: Set `ignore_bias_buffers: true` in the script arguments, or ensure the code filters out boolean buffers from DDP synchronization.
Value: Prevents NCCL RuntimeErrors during distributed training.
Trade-off: The ignored buffers are not synchronized across processes, but boolean masks are typically identical across GPUs so this is safe.

Reasoning

NCCL (the communication backend for multi-GPU training) does not support all-reduce on boolean tensors. Some transformer models store attention masks as named buffers, which DDP tries to synchronize by default. The fix explicitly excludes these from synchronization.

Code evidence from `scripts/dpo.py:105-109`:

    if script_args.ignore_bias_buffers:
        # torch distributed hack
        model._ddp_params_and_buffers_to_ignore = [
            name for name, buffer in model.named_buffers() if buffer.dtype == torch.bool
        ]

Identical pattern in `scripts/orpo.py:105-109`:

    if script_args.ignore_bias_buffers:
        # torch distributed hack
        model._ddp_params_and_buffers_to_ignore = [
            name for name, buffer in model.named_buffers() if buffer.dtype == torch.bool
        ]

The comment torch distributed hack in the source code explicitly flags this as tribal knowledge.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment