Implementation:Facebookresearch Habitat lab Init distrib slurm
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Reinforcement_Learning |
| Last Updated | 2026-02-15 02:00 GMT |
Overview
Concrete function for initializing PyTorch distributed processes in SLURM or local multi-GPU environments for DD-PPO training, provided by habitat-baselines.
Description
The init_distrib_slurm function detects the distributed environment (SLURM cluster or local multi-GPU), determines rank/world_size, initializes the PyTorch distributed process group with the NCCL backend, and creates a TCPStore for shared state. The companion DDPPO.init_distributed method wraps the policy in DistributedDataParallel.
Usage
Called at the beginning of `PPOTrainer._init_train()` when `self._is_distributed` is True. Only needed for multi-GPU DD-PPO training.
Code Reference
Source Location
- Repository: habitat-lab
- File: habitat-baselines/habitat_baselines/rl/ddppo/ddp_utils.py
- Lines: L271-309 (init_distrib_slurm), L247-264 (get_distrib_size)
Signature
def init_distrib_slurm(
backend: str = "nccl",
) -> Tuple[int, torch.distributed.TCPStore]:
"""
Initialize torch.distributed from SLURM environment variables.
Args:
backend: Distributed backend ("nccl" for GPU, "gloo" for CPU)
Returns:
Tuple of (local_rank, tcp_store)
"""
Import
from habitat_baselines.rl.ddppo.ddp_utils import init_distrib_slurm, get_distrib_size
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| backend | str | No | Distributed backend, defaults to "nccl" |
| SLURM env vars | Environment | No | SLURM_JOB_ID, SLURM_STEP_NODELIST, etc. (auto-detected) |
Outputs
| Name | Type | Description |
|---|---|---|
| local_rank | int | Local GPU rank for this process |
| tcp_store | torch.distributed.TCPStore | Shared key-value store for coordination |
Usage Examples
Initialize Distributed Training
from habitat_baselines.rl.ddppo.ddp_utils import init_distrib_slurm
# Initialize distributed process group
local_rank, tcp_store = init_distrib_slurm(backend="nccl")
# Set device for this worker
device = torch.device(f"cuda:{local_rank}")
torch.cuda.set_device(device)