Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Deepspeedai DeepSpeed Shared Memory Sizing

From Leeroopedia




Knowledge Sources
Domains Infrastructure, Distributed_Training
Last Updated 2026-02-09 00:00 GMT

Overview

NCCL-based distributed training requires /dev/shm to be at least 512MB; Docker default of 64MB causes silent failures. Use `--shm-size='1gb'` or larger.

Description

NCCL (NVIDIA Collective Communications Library) uses shared memory (`/dev/shm`) for inter-process communication on the same node. Docker containers default to a 64MB shared memory limit, which is far too small for NCCL's needs during multi-GPU training. When `/dev/shm` is undersized (< 512MB), NCCL operations may fail silently or cause cryptic errors during all-reduce, all-gather, and other collective operations. DeepSpeed's `ds_report` utility checks `/dev/shm` size and warns if it appears insufficient.

Usage

Use this heuristic whenever running DeepSpeed in a Docker container or any environment where `/dev/shm` size may be restricted. This is one of the most common causes of "mysterious" distributed training failures in containerized environments.

The Insight (Rule of Thumb)

  • Action: Set `--shm-size='1gb'` (minimum) when running Docker containers for distributed training. For large-scale training, use `--shm-size='8gb'` or larger.
  • Value: Minimum 512MB, recommended >= 1GB.
  • Trade-off: Allocating shared memory reduces available system RAM, but the amount needed is negligible for GPU training nodes.
  • Alternative: Use `--ipc=host` in Docker to share the host's `/dev/shm`.

Reasoning

NCCL implements shared-memory-based communication optimizations for intra-node GPU-to-GPU transfers. When shared memory is insufficient, NCCL falls back to slower communication paths or fails entirely. The 512MB threshold in DeepSpeed's check is conservative; actual NCCL requirements depend on the number of GPUs, tensor sizes, and communication patterns. Production multi-GPU training typically benefits from 1-8GB of shared memory.

Code Evidence

Shared memory check from `deepspeed/env_report.py:103-120`:

def get_shm_size():
    try:
        shm_stats = os.statvfs('/dev/shm')
    except (OSError, FileNotFoundError, ValueError, AttributeError):
        return "UNKNOWN", None

    shm_size = shm_stats.f_frsize * shm_stats.f_blocks
    shm_hbytes = human_readable_size(shm_size)
    warn = []
    if shm_size < 512 * 1024**2:
        warn.append(
            " [WARNING] /dev/shm size might be too small, if running in docker "
            "increase to at least --shm-size='1gb'"
        )
        if get_accelerator().communication_backend_name() == "nccl":
            warn.append(
                " [WARNING] see more details about NCCL requirements: "
                "https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#sharing-data"
            )
    return shm_hbytes, warn

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment