Principle:FMInference FlexLLMGen Distributed Process Launching

Field	Value
Sources	Paper: FlexGen, DeepSpeed Launcher Documentation, PyTorch Distributed Documentation
Domains	Distributed_Training, Process_Management
Last Updated	2026-02-09 00:00 GMT

Overview

A two-tier process launching strategy for distributed training where a runner orchestrates multi-node deployment and a per-node launcher spawns GPU-specific worker processes, with support for elastic training, signal-based lifecycle management, and per-rank logging.

Description

Distributed process launching is the mechanism by which a distributed training job goes from a single command invocation to N worker processes running across M nodes, each bound to a specific GPU. The DeepSpeed launcher uses a two-tier architecture:

Tier 1: Per-node launcher -- Runs on each worker node and is responsible for spawning one subprocess per local GPU. It receives a world_info dictionary (encoded as base64 JSON) that maps hostnames to GPU ID lists, from which it computes global ranks and sets environment variables.
Environment variable protocol -- Distributed training frameworks use environment variables (RANK, WORLD_SIZE, LOCAL_RANK, MASTER_ADDR, MASTER_PORT) to communicate topology information to each worker process. The launcher sets these variables per-process before spawning, establishing the distributed context.
CUDA device binding -- CUDA_VISIBLE_DEVICES is set at the node level to restrict GPU access to the assigned devices. Combined with LOCAL_RANK, this ensures each process accesses only its assigned GPU.
Signal-based lifecycle -- The launcher installs SIGINT and SIGTERM handlers that propagate termination to all child processes using process tree killing (via psutil). This ensures that:
- Ctrl+C in the terminal cleanly shuts down all workers.
- Programmatic termination via PID files cleanly shuts down the job.
- No zombie processes remain after shutdown.
Elastic training mode -- When enabled, the launcher uses PyTorch's elastic agent (c10d rendezvous) instead of static process spawning. This allows nodes to join or leave the training job dynamically, with automatic re-rendezvous and state restoration.
Failure propagation -- If any worker process exits with a non-zero return code, the launcher terminates all remaining workers and exits with the failing process's return code.

Usage

Use distributed process launching for any multi-GPU or multi-node DeepSpeed training or inference job. The launcher handles all the complexity of rank assignment, GPU binding, and process lifecycle.

Key configuration choices:

Static vs. elastic -- Use static launching for fixed-size jobs. Use elastic launching when nodes may fail or when autoscaling is desired.
Per-rank logging -- Enable when debugging multi-GPU issues to get separate log files per rank.
Module mode -- Use --module when the training script is a Python module rather than a script file.

Theoretical Basis

The two-tier architecture follows the master-worker pattern common in distributed systems. The design separates concerns:

Cross-node orchestration (runner tier) handles SSH, resource selection, and environment export.
Intra-node spawning (launcher tier) handles process creation, GPU binding, and local lifecycle.

This separation allows the launcher to be reused across different multi-node backends (PDSH, OpenMPI, MVAPICH, Slurm) while maintaining a consistent per-node behavior.

The signal handling follows the supervisor pattern: the launcher acts as a supervisor process that monitors its children and enforces all-or-nothing execution semantics (if one fails, all are terminated).

Related Pages

Implementation:FMInference_FlexLLMGen_DeepSpeed_Launch

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment