Environment:Facebookresearch Habitat lab SLURM Distributed Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Training |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
SLURM cluster or torch.distributed.launch environment with NCCL/GLOO backends, network interface configuration, and signal-based preemption for DD-PPO distributed training.
Description
This environment provides the distributed training infrastructure for Decentralized Distributed PPO (DD-PPO) in Habitat-Lab. It supports two launch methods: SLURM job scheduling (via `srun`) and PyTorch distributed launch (via `torchrun`/`torch.distributed.launch`). The system auto-detects the launch method from environment variables, configures network interfaces for NCCL and GLOO backends, handles SLURM job preemption via signal handlers, and manages checkpoint resume state across job restarts.
Usage
Use this environment when running multi-GPU or multi-node DD-PPO training. It is required for the `Init_distrib_slurm` implementation and for any training configuration that uses the DD-PPO algorithm across multiple processes.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | SLURM typically runs on Linux clusters |
| Hardware | Multiple NVIDIA GPUs | One GPU per training process |
| Network | InfiniBand or high-speed Ethernet | Required for NCCL inter-node communication |
| Software | SLURM workload manager OR torchrun | For process launching and scheduling |
| Software | NCCL library | For GPU-to-GPU communication backend |
Dependencies
System Packages
- SLURM (`srun`, `scontrol`, `sbatch`) OR PyTorch distributed launcher (`torchrun`)
- NCCL (NVIDIA Collective Communications Library)
- Network interface tools
Python Packages
- `torch` >= 1.3.1 (with `torch.distributed` support)
- `ifcfg` (for automatic network interface detection)
Credentials
The following environment variables must be set by the scheduler or manually:
SLURM mode (auto-set by SLURM):
- `SLURM_JOB_ID`: Unique job identifier
- `SLURM_JOB_NAME`: Job name (used to detect batch vs interactive)
- `SLURM_LOCALID`: Local GPU rank within the node
- `SLURM_PROCID`: Global process rank
- `SLURM_NTASKS`: Total number of processes (world size)
torch.distributed.launch mode:
- `LOCAL_RANK`: Local GPU rank
- `RANK`: Global process rank
- `WORLD_SIZE`: Total number of processes
Optional overrides:
- `MAIN_PORT`: TCP port for rendezvous (default: 8738)
- `MAIN_PORT_RANGE`: Port range for SLURM job offset (default: 127)
- `MAIN_ADDR`: Address of rank-0 process (default: "127.0.0.1")
- `GLOO_SOCKET_IFNAME`: Network interface for GLOO backend (auto-detected if unset)
- `NCCL_SOCKET_IFNAME`: Network interface for NCCL backend (auto-detected if unset)
Quick Install
# Install distributed training dependencies
pip install "torch>=1.3.1" ifcfg
# SLURM launch example (single node, 4 GPUs)
srun --gres=gpu:4 --ntasks=4 --ntasks-per-node=4 \
python -u habitat_baselines/run.py \
--config-name=pointnav/ddppo_pointnav.yaml
# torchrun launch example (single node, 4 GPUs)
torchrun --nproc_per_node=4 \
habitat_baselines/run.py \
--config-name=pointnav/ddppo_pointnav.yaml
Code Evidence
Distributed size detection from `habitat-baselines/habitat_baselines/rl/ddppo/ddp_utils.py:247-264`:
def get_distrib_size() -> Tuple[int, int, int]:
# Check to see if we should parse from torch.distributed.launch
if os.environ.get("LOCAL_RANK", None) is not None:
local_rank = int(os.environ["LOCAL_RANK"])
world_rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
# Else parse from SLURM is using SLURM
elif os.environ.get("SLURM_JOBID", None) is not None:
local_rank = int(os.environ["SLURM_LOCALID"])
world_rank = int(os.environ["SLURM_PROCID"])
world_size = int(os.environ["SLURM_NTASKS"])
# Otherwise setup for just 1 process, this is nice for testing
else:
local_rank = 0
world_rank = 0
world_size = 1
return local_rank, world_rank, world_size
Network interface and process group initialization from `habitat-baselines/habitat_baselines/rl/ddppo/ddp_utils.py:271-309`:
def init_distrib_slurm(backend: str = "nccl") -> Tuple[int, torch.distributed.TCPStore]:
assert torch.distributed.is_available(), "torch.distributed must be available"
if "GLOO_SOCKET_IFNAME" not in os.environ:
os.environ["GLOO_SOCKET_IFNAME"] = get_ifname()
if "NCCL_SOCKET_IFNAME" not in os.environ:
os.environ["NCCL_SOCKET_IFNAME"] = get_ifname()
local_rank, world_rank, world_size = get_distrib_size()
main_port = int(os.environ.get("MAIN_PORT", DEFAULT_PORT))
if SLURM_JOBID is not None:
main_port += int(SLURM_JOBID) % int(
os.environ.get("MAIN_PORT_RANGE", DEFAULT_PORT_RANGE)
)
SLURM batch job detection from `habitat-baselines/habitat_baselines/rl/ddppo/ddp_utils.py:58-71`:
def is_slurm_batch_job() -> bool:
r"""Heuristic to determine if a slurm job is a batch job or not."""
return is_slurm_job() and os.environ.get("SLURM_JOB_NAME", None) not in (
None, "bash", "zsh", "fish", "tcsh", "sh", "interactive",
)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `RuntimeError: torch.distributed must be available` | PyTorch built without distributed support | Reinstall PyTorch with distributed support |
| `RuntimeError: NCCL error` | Network interface misconfigured | Set `NCCL_SOCKET_IFNAME` to the correct interface (e.g., `eth0`) |
| `Address already in use` | Port conflict from concurrent jobs | Set `MAIN_PORT` to an unused port or use `MAIN_PORT_RANGE` |
| `scontrol: error: Job requeue failed` | Not a SLURM batch job | Requeue only works with `sbatch`, not `srun` interactive jobs |
Compatibility Notes
- GLOO backend: Default in config (`distrib_backend: "GLOO"`). Used for CPU-only distributed training and for VER preemption decider.
- NCCL backend: Specified in DD-PPO YAML configs (`distrib_backend: NCCL`). Required for efficient GPU-to-GPU communication.
- Signal handling: SLURM preemption uses SIGTERM (save and exit), SIGUSR1 (requeue), SIGUSR2 (clean exit without timer). When using NCCL, use `scancel --signal SIGUSR2` for clean shutdown.
- Port allocation: When running multiple SLURM jobs, the port is automatically offset by `SLURM_JOB_ID % MAIN_PORT_RANGE` to avoid conflicts.