Heuristic:Deepspeedai DeepSpeed Shared Memory Sizing
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Training |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
NCCL-based distributed training requires /dev/shm to be at least 512MB; Docker default of 64MB causes silent failures. Use `--shm-size='1gb'` or larger.
Description
NCCL (NVIDIA Collective Communications Library) uses shared memory (`/dev/shm`) for inter-process communication on the same node. Docker containers default to a 64MB shared memory limit, which is far too small for NCCL's needs during multi-GPU training. When `/dev/shm` is undersized (< 512MB), NCCL operations may fail silently or cause cryptic errors during all-reduce, all-gather, and other collective operations. DeepSpeed's `ds_report` utility checks `/dev/shm` size and warns if it appears insufficient.
Usage
Use this heuristic whenever running DeepSpeed in a Docker container or any environment where `/dev/shm` size may be restricted. This is one of the most common causes of "mysterious" distributed training failures in containerized environments.
The Insight (Rule of Thumb)
- Action: Set `--shm-size='1gb'` (minimum) when running Docker containers for distributed training. For large-scale training, use `--shm-size='8gb'` or larger.
- Value: Minimum 512MB, recommended >= 1GB.
- Trade-off: Allocating shared memory reduces available system RAM, but the amount needed is negligible for GPU training nodes.
- Alternative: Use `--ipc=host` in Docker to share the host's `/dev/shm`.
Reasoning
NCCL implements shared-memory-based communication optimizations for intra-node GPU-to-GPU transfers. When shared memory is insufficient, NCCL falls back to slower communication paths or fails entirely. The 512MB threshold in DeepSpeed's check is conservative; actual NCCL requirements depend on the number of GPUs, tensor sizes, and communication patterns. Production multi-GPU training typically benefits from 1-8GB of shared memory.
Code Evidence
Shared memory check from `deepspeed/env_report.py:103-120`:
def get_shm_size():
try:
shm_stats = os.statvfs('/dev/shm')
except (OSError, FileNotFoundError, ValueError, AttributeError):
return "UNKNOWN", None
shm_size = shm_stats.f_frsize * shm_stats.f_blocks
shm_hbytes = human_readable_size(shm_size)
warn = []
if shm_size < 512 * 1024**2:
warn.append(
" [WARNING] /dev/shm size might be too small, if running in docker "
"increase to at least --shm-size='1gb'"
)
if get_accelerator().communication_backend_name() == "nccl":
warn.append(
" [WARNING] see more details about NCCL requirements: "
"https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#sharing-data"
)
return shm_hbytes, warn