Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Facebookresearch Habitat lab Init distrib slurm

From Leeroopedia
Knowledge Sources
Domains Distributed_Computing, Reinforcement_Learning
Last Updated 2026-02-15 02:00 GMT

Overview

Concrete function for initializing PyTorch distributed processes in SLURM or local multi-GPU environments for DD-PPO training, provided by habitat-baselines.

Description

The init_distrib_slurm function detects the distributed environment (SLURM cluster or local multi-GPU), determines rank/world_size, initializes the PyTorch distributed process group with the NCCL backend, and creates a TCPStore for shared state. The companion DDPPO.init_distributed method wraps the policy in DistributedDataParallel.

Usage

Called at the beginning of `PPOTrainer._init_train()` when `self._is_distributed` is True. Only needed for multi-GPU DD-PPO training.

Code Reference

Source Location

  • Repository: habitat-lab
  • File: habitat-baselines/habitat_baselines/rl/ddppo/ddp_utils.py
  • Lines: L271-309 (init_distrib_slurm), L247-264 (get_distrib_size)

Signature

def init_distrib_slurm(
    backend: str = "nccl",
) -> Tuple[int, torch.distributed.TCPStore]:
    """
    Initialize torch.distributed from SLURM environment variables.

    Args:
        backend: Distributed backend ("nccl" for GPU, "gloo" for CPU)
    Returns:
        Tuple of (local_rank, tcp_store)
    """

Import

from habitat_baselines.rl.ddppo.ddp_utils import init_distrib_slurm, get_distrib_size

I/O Contract

Inputs

Name Type Required Description
backend str No Distributed backend, defaults to "nccl"
SLURM env vars Environment No SLURM_JOB_ID, SLURM_STEP_NODELIST, etc. (auto-detected)

Outputs

Name Type Description
local_rank int Local GPU rank for this process
tcp_store torch.distributed.TCPStore Shared key-value store for coordination

Usage Examples

Initialize Distributed Training

from habitat_baselines.rl.ddppo.ddp_utils import init_distrib_slurm

# Initialize distributed process group
local_rank, tcp_store = init_distrib_slurm(backend="nccl")

# Set device for this worker
device = torch.device(f"cuda:{local_rank}")
torch.cuda.set_device(device)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment