Principle:OpenGVLab InternVL Gradient Compression

Knowledge Sources	OpenGVLab_InternVL
Domains	Distributed Training, Gradient Compression, Communication Efficiency
Last Updated	2026-02-07 14:00 GMT

Overview

Gradient Compression reduces the communication bandwidth required for gradient synchronization in distributed training by casting gradient tensors to lower-precision formats before allreduce operations.

Description

In distributed data-parallel training, each worker computes gradients on its local data and then synchronizes gradients across all workers via allreduce. For large models (e.g., InternViT-6B with 6 billion parameters), this gradient synchronization can become a significant bottleneck, especially in multi-node settings where inter-node bandwidth is limited.

Gradient Compression addresses this by:

Casting gradient tensors from full precision (float32) to half precision (float16 or bfloat16) before the allreduce operation, halving the communication volume.
Allreducing the compressed gradients across all workers.
Decompressing the result back to the original precision via in-place copy to minimize peak memory usage.

This can be applied as a standalone hook (replacing the default allreduce) or as a wrapper around other communication hooks (e.g., PowerSGD), enabling compositional optimization strategies.

The trade-off is a small loss in gradient precision, which is generally acceptable for training stability when using appropriate loss scaling and optimizer settings.

Usage

Apply gradient compression when training large models across multiple GPUs or nodes where gradient communication bandwidth is a bottleneck. Register the appropriate DDP communication hook (FP16 or BF16) on the DistributedDataParallel model before training begins.

Theoretical Basis

Gradient compression leverages the observation that deep learning training is robust to small perturbations in gradient values. Half-precision gradients introduce quantization noise bounded by the representable range of float16/bfloat16, which empirically does not degrade model convergence for most architectures when combined with loss scaling.

The key considerations are:

FP16 has a narrower dynamic range and may overflow for very large gradients, mitigated by dividing by world size before allreduce.
BF16 has the same dynamic range as float32 but lower mantissa precision, making it more robust for gradient compression.
Asynchronous execution via futures enables overlap of compression/decompression with computation.

Related Pages

Implementation:OpenGVLab_InternVL_DDP_Communication_Hooks

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment