Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:OpenGVLab InternVL DDP Gradient Compression

From Leeroopedia
Revision as of 17:13, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/OpenGVLab_InternVL_DDP_Gradient_Compression.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Distributed Training, Gradient Compression, Optimization
Last Updated 2026-02-07 14:00 GMT

Overview

The DDP gradient compression principle reduces inter-GPU communication bandwidth during distributed training by casting gradient tensors to half-precision formats before allreduce operations.

Description

In distributed data parallel (DDP) training, gradient synchronization via allreduce is often the communication bottleneck, especially for large models like InternViT-6B. This principle reduces bandwidth by:

  1. Pre-reduction compression: Gradient tensors are cast from FP32 to half-precision (FP16 or BF16) before the allreduce operation, halving the communication volume.
  2. Division before reduction: The tensor is divided by the world size before allreduce (rather than after), preventing overflow in half-precision formats.
  3. Post-reduction decompression: After allreduce, the compressed result is copied back to the original FP32 buffer in-place to minimize peak memory usage.

The principle supports both direct hooks (which replace the default allreduce) and wrappers (which compose compression with any existing communication hook, such as PowerSGD), providing flexibility in combining multiple gradient optimization strategies.

BFloat16 compression is preferred over FP16 when the dynamic range of FP16 is insufficient, though it requires NCCL version > 2.9.6.

Usage

Apply this principle when training large vision models in multi-GPU settings where gradient communication is a bottleneck, particularly for the InternVL segmentation pipeline.

Theoretical Basis

Gradient compression is a well-studied technique in distributed deep learning. Half-precision allreduce reduces communication by 2x with minimal impact on convergence for most training scenarios, as gradient noise from compression is typically smaller than stochastic gradient noise. The in-place decompression pattern follows PyTorch best practices for memory efficiency.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment