Environment:Facebookresearch Habitat lab SLURM Distributed Environment

Knowledge Sources	Habitat-Lab DD-PPO README
Domains	Infrastructure, Distributed_Training
Last Updated	2026-02-15 00:00 GMT

Overview

SLURM cluster or torch.distributed.launch environment with NCCL/GLOO backends, network interface configuration, and signal-based preemption for DD-PPO distributed training.

Description

This environment provides the distributed training infrastructure for Decentralized Distributed PPO (DD-PPO) in Habitat-Lab. It supports two launch methods: SLURM job scheduling (via `srun`) and PyTorch distributed launch (via `torchrun`/`torch.distributed.launch`). The system auto-detects the launch method from environment variables, configures network interfaces for NCCL and GLOO backends, handles SLURM job preemption via signal handlers, and manages checkpoint resume state across job restarts.

Usage

Use this environment when running multi-GPU or multi-node DD-PPO training. It is required for the `Init_distrib_slurm` implementation and for any training configuration that uses the DD-PPO algorithm across multiple processes.

System Requirements

Category	Requirement	Notes
OS	Linux	SLURM typically runs on Linux clusters
Hardware	Multiple NVIDIA GPUs	One GPU per training process
Network	InfiniBand or high-speed Ethernet	Required for NCCL inter-node communication
Software	SLURM workload manager OR torchrun	For process launching and scheduling
Software	NCCL library	For GPU-to-GPU communication backend

Dependencies

System Packages

SLURM (`srun`, `scontrol`, `sbatch`) OR PyTorch distributed launcher (`torchrun`)
NCCL (NVIDIA Collective Communications Library)
Network interface tools

Python Packages

`torch` >= 1.3.1 (with `torch.distributed` support)
`ifcfg` (for automatic network interface detection)

Credentials

The following environment variables must be set by the scheduler or manually:

SLURM mode (auto-set by SLURM):

`SLURM_JOB_ID`: Unique job identifier
`SLURM_JOB_NAME`: Job name (used to detect batch vs interactive)
`SLURM_LOCALID`: Local GPU rank within the node
`SLURM_PROCID`: Global process rank
`SLURM_NTASKS`: Total number of processes (world size)

torch.distributed.launch mode:

`LOCAL_RANK`: Local GPU rank
`RANK`: Global process rank
`WORLD_SIZE`: Total number of processes

Optional overrides:

`MAIN_PORT`: TCP port for rendezvous (default: 8738)
`MAIN_PORT_RANGE`: Port range for SLURM job offset (default: 127)
`MAIN_ADDR`: Address of rank-0 process (default: "127.0.0.1")
`GLOO_SOCKET_IFNAME`: Network interface for GLOO backend (auto-detected if unset)
`NCCL_SOCKET_IFNAME`: Network interface for NCCL backend (auto-detected if unset)

Quick Install

# Install distributed training dependencies
pip install "torch>=1.3.1" ifcfg

# SLURM launch example (single node, 4 GPUs)
srun --gres=gpu:4 --ntasks=4 --ntasks-per-node=4 \
    python -u habitat_baselines/run.py \
    --config-name=pointnav/ddppo_pointnav.yaml

# torchrun launch example (single node, 4 GPUs)
torchrun --nproc_per_node=4 \
    habitat_baselines/run.py \
    --config-name=pointnav/ddppo_pointnav.yaml

Code Evidence

Distributed size detection from `habitat-baselines/habitat_baselines/rl/ddppo/ddp_utils.py:247-264`:

def get_distrib_size() -> Tuple[int, int, int]:
    # Check to see if we should parse from torch.distributed.launch
    if os.environ.get("LOCAL_RANK", None) is not None:
        local_rank = int(os.environ["LOCAL_RANK"])
        world_rank = int(os.environ["RANK"])
        world_size = int(os.environ["WORLD_SIZE"])
    # Else parse from SLURM is using SLURM
    elif os.environ.get("SLURM_JOBID", None) is not None:
        local_rank = int(os.environ["SLURM_LOCALID"])
        world_rank = int(os.environ["SLURM_PROCID"])
        world_size = int(os.environ["SLURM_NTASKS"])
    # Otherwise setup for just 1 process, this is nice for testing
    else:
        local_rank = 0
        world_rank = 0
        world_size = 1
    return local_rank, world_rank, world_size

Network interface and process group initialization from `habitat-baselines/habitat_baselines/rl/ddppo/ddp_utils.py:271-309`:

def init_distrib_slurm(backend: str = "nccl") -> Tuple[int, torch.distributed.TCPStore]:
    assert torch.distributed.is_available(), "torch.distributed must be available"

    if "GLOO_SOCKET_IFNAME" not in os.environ:
        os.environ["GLOO_SOCKET_IFNAME"] = get_ifname()
    if "NCCL_SOCKET_IFNAME" not in os.environ:
        os.environ["NCCL_SOCKET_IFNAME"] = get_ifname()

    local_rank, world_rank, world_size = get_distrib_size()
    main_port = int(os.environ.get("MAIN_PORT", DEFAULT_PORT))
    if SLURM_JOBID is not None:
        main_port += int(SLURM_JOBID) % int(
            os.environ.get("MAIN_PORT_RANGE", DEFAULT_PORT_RANGE)
        )

SLURM batch job detection from `habitat-baselines/habitat_baselines/rl/ddppo/ddp_utils.py:58-71`:

def is_slurm_batch_job() -> bool:
    r"""Heuristic to determine if a slurm job is a batch job or not."""
    return is_slurm_job() and os.environ.get("SLURM_JOB_NAME", None) not in (
        None, "bash", "zsh", "fish", "tcsh", "sh", "interactive",
    )

Common Errors

Error Message	Cause	Solution
`RuntimeError: torch.distributed must be available`	PyTorch built without distributed support	Reinstall PyTorch with distributed support
`RuntimeError: NCCL error`	Network interface misconfigured	Set `NCCL_SOCKET_IFNAME` to the correct interface (e.g., `eth0`)
`Address already in use`	Port conflict from concurrent jobs	Set `MAIN_PORT` to an unused port or use `MAIN_PORT_RANGE`
`scontrol: error: Job requeue failed`	Not a SLURM batch job	Requeue only works with `sbatch`, not `srun` interactive jobs

Compatibility Notes

GLOO backend: Default in config (`distrib_backend: "GLOO"`). Used for CPU-only distributed training and for VER preemption decider.
NCCL backend: Specified in DD-PPO YAML configs (`distrib_backend: NCCL`). Required for efficient GPU-to-GPU communication.
Signal handling: SLURM preemption uses SIGTERM (save and exit), SIGUSR1 (requeue), SIGUSR2 (clean exit without timer). When using NCCL, use `scancel --signal SIGUSR2` for clean shutdown.
Port allocation: When running multiple SLURM jobs, the port is automatically offset by `SLURM_JOB_ID % MAIN_PORT_RANGE` to avoid conflicts.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment