Implementation:FMInference FlexLLMGen DeepSpeed Launch

Field	Value
Sources	Repo: FlexLLMGen
Domains	Distributed_Training, Process_Management
Last Updated	2026-02-09 00:00 GMT

Overview

Vendored DeepSpeed per-node launcher that spawns multiple worker sub-processes for distributed training on a single node, managing GPU assignments, environment variables, signal handling, elastic training, and per-rank logging.

Description

launch.py is the per-node component of DeepSpeed's distributed launch system. It runs on each worker node (invoked by the runner) and spawns one worker sub-process per local GPU. Each sub-process receives the correct RANK, LOCAL_RANK, WORLD_SIZE, and CUDA_VISIBLE_DEVICES environment variables.

Key features:

Argument parsing -- Accepts node_rank, master_addr, master_port, world_info (base64-encoded JSON mapping hostnames to GPU IDs), and the user's training script with its arguments.
Global rank computation -- Decodes the world_info dictionary to compute global ranks from the node list and local GPU IDs. Builds a global_rank_mapping that assigns consecutive global ranks across all nodes.
GPU assignment -- Sets CUDA_VISIBLE_DEVICES on each node to the local GPU IDs specified in world_info.
Process spawning -- For non-elastic training, spawns one subprocess per local GPU via subprocess.Popen, passing RANK, LOCAL_RANK, and optionally --local_rank as a command-line argument.
Elastic training -- When --enable_elastic_training is set, uses PyTorch's elastic agent (DSElasticAgent) with c10d rendezvous instead of direct subprocess spawning. Supports min/max elastic node counts.
Per-rank logging -- When --enable_each_rank_log is set, redirects each rank's stdout/stderr to a separate timestamped log file.
Signal handling -- Installs SIGINT and SIGTERM handlers that terminate all child process trees using psutil, ensuring clean shutdown on interruption.
Process tree termination -- Uses psutil to recursively kill all children of each worker process, preventing zombie processes.
PID file management -- Optionally writes the launcher PID to /tmp for programmatic process tracking.

This is AUTO_KEEP vendored code from DeepSpeed.

Code Reference

Field	Value
Repository	FlexLLMGen
File	benchmark/third_party/DeepSpeed/deepspeed/launcher/launch.py
Lines	1-358

Key Functions:

def parse_args(): ...

def terminate_process_tree(pid): ...

def main():
    # Decodes world_info, computes ranks, spawns processes
    # Handles elastic training via DSElasticAgent
    # Installs signal handlers for clean shutdown
    ...

I/O Contract

Command-Line Arguments

Argument	Type	Default	Description
--node_rank	int	0	Rank of this node in multi-node setup
--master_addr	str	127.0.0.1	IP address of rank 0 node
--master_port	int	29500	Communication port for torch.distributed
--world_info	str	None	Base64-encoded JSON: {hostname: [gpu_ids]}
--module	flag	False	Run training script as Python module (-m)
--no_python	flag	False	Execute training script directly without python
--enable_elastic_training	flag	False	Use PyTorch elastic agent
--no_local_rank	flag	False	Do not pass --local_rank to training script
--save_pid	int	0	PID of main process for tracking
--enable_each_rank_log	str	None	Directory for per-rank log files
training_script	str	required	Path to user's training script
training_script_args	list	[]	Arguments for the training script

Environment Variables Set

Variable	Description
CUDA_VISIBLE_DEVICES	Comma-separated GPU IDs for this node
MASTER_ADDR	IP address of rank 0
MASTER_PORT	Communication port
WORLD_SIZE	Total number of processes across all nodes
RANK	Global rank of each spawned process
LOCAL_RANK	Local GPU rank on this node

Related Pages

Principle:FMInference_FlexLLMGen_Distributed_Process_Launching

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment