Principle:FMInference FlexLLMGen Distributed Process Launching
| Field | Value |
|---|---|
| Sources | Paper: FlexGen, DeepSpeed Launcher Documentation, PyTorch Distributed Documentation |
| Domains | Distributed_Training, Process_Management |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A two-tier process launching strategy for distributed training where a runner orchestrates multi-node deployment and a per-node launcher spawns GPU-specific worker processes, with support for elastic training, signal-based lifecycle management, and per-rank logging.
Description
Distributed process launching is the mechanism by which a distributed training job goes from a single command invocation to N worker processes running across M nodes, each bound to a specific GPU. The DeepSpeed launcher uses a two-tier architecture:
- Tier 1: Per-node launcher -- Runs on each worker node and is responsible for spawning one subprocess per local GPU. It receives a world_info dictionary (encoded as base64 JSON) that maps hostnames to GPU ID lists, from which it computes global ranks and sets environment variables.
- Environment variable protocol -- Distributed training frameworks use environment variables (RANK, WORLD_SIZE, LOCAL_RANK, MASTER_ADDR, MASTER_PORT) to communicate topology information to each worker process. The launcher sets these variables per-process before spawning, establishing the distributed context.
- CUDA device binding -- CUDA_VISIBLE_DEVICES is set at the node level to restrict GPU access to the assigned devices. Combined with LOCAL_RANK, this ensures each process accesses only its assigned GPU.
- Signal-based lifecycle -- The launcher installs SIGINT and SIGTERM handlers that propagate termination to all child processes using process tree killing (via psutil). This ensures that:
- Ctrl+C in the terminal cleanly shuts down all workers.
- Programmatic termination via PID files cleanly shuts down the job.
- No zombie processes remain after shutdown.
- Elastic training mode -- When enabled, the launcher uses PyTorch's elastic agent (c10d rendezvous) instead of static process spawning. This allows nodes to join or leave the training job dynamically, with automatic re-rendezvous and state restoration.
- Failure propagation -- If any worker process exits with a non-zero return code, the launcher terminates all remaining workers and exits with the failing process's return code.
Usage
Use distributed process launching for any multi-GPU or multi-node DeepSpeed training or inference job. The launcher handles all the complexity of rank assignment, GPU binding, and process lifecycle.
Key configuration choices:
- Static vs. elastic -- Use static launching for fixed-size jobs. Use elastic launching when nodes may fail or when autoscaling is desired.
- Per-rank logging -- Enable when debugging multi-GPU issues to get separate log files per rank.
- Module mode -- Use --module when the training script is a Python module rather than a script file.
Theoretical Basis
The two-tier architecture follows the master-worker pattern common in distributed systems. The design separates concerns:
- Cross-node orchestration (runner tier) handles SSH, resource selection, and environment export.
- Intra-node spawning (launcher tier) handles process creation, GPU binding, and local lifecycle.
This separation allows the launcher to be reused across different multi-node backends (PDSH, OpenMPI, MVAPICH, Slurm) while maintaining a consistent per-node behavior.
The signal handling follows the supervisor pattern: the launcher acts as a supervisor process that monitors its children and enforces all-or-nothing execution semantics (if one fails, all are terminated).