Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Facebookresearch Habitat lab SLURM Distributed Environment

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Distributed_Training
Last Updated 2026-02-15 00:00 GMT

Overview

SLURM cluster or torch.distributed.launch environment with NCCL/GLOO backends, network interface configuration, and signal-based preemption for DD-PPO distributed training.

Description

This environment provides the distributed training infrastructure for Decentralized Distributed PPO (DD-PPO) in Habitat-Lab. It supports two launch methods: SLURM job scheduling (via `srun`) and PyTorch distributed launch (via `torchrun`/`torch.distributed.launch`). The system auto-detects the launch method from environment variables, configures network interfaces for NCCL and GLOO backends, handles SLURM job preemption via signal handlers, and manages checkpoint resume state across job restarts.

Usage

Use this environment when running multi-GPU or multi-node DD-PPO training. It is required for the `Init_distrib_slurm` implementation and for any training configuration that uses the DD-PPO algorithm across multiple processes.

System Requirements

Category Requirement Notes
OS Linux SLURM typically runs on Linux clusters
Hardware Multiple NVIDIA GPUs One GPU per training process
Network InfiniBand or high-speed Ethernet Required for NCCL inter-node communication
Software SLURM workload manager OR torchrun For process launching and scheduling
Software NCCL library For GPU-to-GPU communication backend

Dependencies

System Packages

  • SLURM (`srun`, `scontrol`, `sbatch`) OR PyTorch distributed launcher (`torchrun`)
  • NCCL (NVIDIA Collective Communications Library)
  • Network interface tools

Python Packages

  • `torch` >= 1.3.1 (with `torch.distributed` support)
  • `ifcfg` (for automatic network interface detection)

Credentials

The following environment variables must be set by the scheduler or manually:

SLURM mode (auto-set by SLURM):

  • `SLURM_JOB_ID`: Unique job identifier
  • `SLURM_JOB_NAME`: Job name (used to detect batch vs interactive)
  • `SLURM_LOCALID`: Local GPU rank within the node
  • `SLURM_PROCID`: Global process rank
  • `SLURM_NTASKS`: Total number of processes (world size)

torch.distributed.launch mode:

  • `LOCAL_RANK`: Local GPU rank
  • `RANK`: Global process rank
  • `WORLD_SIZE`: Total number of processes

Optional overrides:

  • `MAIN_PORT`: TCP port for rendezvous (default: 8738)
  • `MAIN_PORT_RANGE`: Port range for SLURM job offset (default: 127)
  • `MAIN_ADDR`: Address of rank-0 process (default: "127.0.0.1")
  • `GLOO_SOCKET_IFNAME`: Network interface for GLOO backend (auto-detected if unset)
  • `NCCL_SOCKET_IFNAME`: Network interface for NCCL backend (auto-detected if unset)

Quick Install

# Install distributed training dependencies
pip install "torch>=1.3.1" ifcfg

# SLURM launch example (single node, 4 GPUs)
srun --gres=gpu:4 --ntasks=4 --ntasks-per-node=4 \
    python -u habitat_baselines/run.py \
    --config-name=pointnav/ddppo_pointnav.yaml

# torchrun launch example (single node, 4 GPUs)
torchrun --nproc_per_node=4 \
    habitat_baselines/run.py \
    --config-name=pointnav/ddppo_pointnav.yaml

Code Evidence

Distributed size detection from `habitat-baselines/habitat_baselines/rl/ddppo/ddp_utils.py:247-264`:

def get_distrib_size() -> Tuple[int, int, int]:
    # Check to see if we should parse from torch.distributed.launch
    if os.environ.get("LOCAL_RANK", None) is not None:
        local_rank = int(os.environ["LOCAL_RANK"])
        world_rank = int(os.environ["RANK"])
        world_size = int(os.environ["WORLD_SIZE"])
    # Else parse from SLURM is using SLURM
    elif os.environ.get("SLURM_JOBID", None) is not None:
        local_rank = int(os.environ["SLURM_LOCALID"])
        world_rank = int(os.environ["SLURM_PROCID"])
        world_size = int(os.environ["SLURM_NTASKS"])
    # Otherwise setup for just 1 process, this is nice for testing
    else:
        local_rank = 0
        world_rank = 0
        world_size = 1
    return local_rank, world_rank, world_size

Network interface and process group initialization from `habitat-baselines/habitat_baselines/rl/ddppo/ddp_utils.py:271-309`:

def init_distrib_slurm(backend: str = "nccl") -> Tuple[int, torch.distributed.TCPStore]:
    assert torch.distributed.is_available(), "torch.distributed must be available"

    if "GLOO_SOCKET_IFNAME" not in os.environ:
        os.environ["GLOO_SOCKET_IFNAME"] = get_ifname()
    if "NCCL_SOCKET_IFNAME" not in os.environ:
        os.environ["NCCL_SOCKET_IFNAME"] = get_ifname()

    local_rank, world_rank, world_size = get_distrib_size()
    main_port = int(os.environ.get("MAIN_PORT", DEFAULT_PORT))
    if SLURM_JOBID is not None:
        main_port += int(SLURM_JOBID) % int(
            os.environ.get("MAIN_PORT_RANGE", DEFAULT_PORT_RANGE)
        )

SLURM batch job detection from `habitat-baselines/habitat_baselines/rl/ddppo/ddp_utils.py:58-71`:

def is_slurm_batch_job() -> bool:
    r"""Heuristic to determine if a slurm job is a batch job or not."""
    return is_slurm_job() and os.environ.get("SLURM_JOB_NAME", None) not in (
        None, "bash", "zsh", "fish", "tcsh", "sh", "interactive",
    )

Common Errors

Error Message Cause Solution
`RuntimeError: torch.distributed must be available` PyTorch built without distributed support Reinstall PyTorch with distributed support
`RuntimeError: NCCL error` Network interface misconfigured Set `NCCL_SOCKET_IFNAME` to the correct interface (e.g., `eth0`)
`Address already in use` Port conflict from concurrent jobs Set `MAIN_PORT` to an unused port or use `MAIN_PORT_RANGE`
`scontrol: error: Job requeue failed` Not a SLURM batch job Requeue only works with `sbatch`, not `srun` interactive jobs

Compatibility Notes

  • GLOO backend: Default in config (`distrib_backend: "GLOO"`). Used for CPU-only distributed training and for VER preemption decider.
  • NCCL backend: Specified in DD-PPO YAML configs (`distrib_backend: NCCL`). Required for efficient GPU-to-GPU communication.
  • Signal handling: SLURM preemption uses SIGTERM (save and exit), SIGUSR1 (requeue), SIGUSR2 (clean exit without timer). When using NCCL, use `scancel --signal SIGUSR2` for clean shutdown.
  • Port allocation: When running multiple SLURM jobs, the port is automatically offset by `SLURM_JOB_ID % MAIN_PORT_RANGE` to avoid conflicts.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment