Implementation:FMInference FlexLLMGen DeepSpeed Launch
| Field | Value |
|---|---|
| Sources | Repo: FlexLLMGen |
| Domains | Distributed_Training, Process_Management |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Vendored DeepSpeed per-node launcher that spawns multiple worker sub-processes for distributed training on a single node, managing GPU assignments, environment variables, signal handling, elastic training, and per-rank logging.
Description
launch.py is the per-node component of DeepSpeed's distributed launch system. It runs on each worker node (invoked by the runner) and spawns one worker sub-process per local GPU. Each sub-process receives the correct RANK, LOCAL_RANK, WORLD_SIZE, and CUDA_VISIBLE_DEVICES environment variables.
Key features:
- Argument parsing -- Accepts node_rank, master_addr, master_port, world_info (base64-encoded JSON mapping hostnames to GPU IDs), and the user's training script with its arguments.
- Global rank computation -- Decodes the world_info dictionary to compute global ranks from the node list and local GPU IDs. Builds a global_rank_mapping that assigns consecutive global ranks across all nodes.
- GPU assignment -- Sets CUDA_VISIBLE_DEVICES on each node to the local GPU IDs specified in world_info.
- Process spawning -- For non-elastic training, spawns one subprocess per local GPU via subprocess.Popen, passing RANK, LOCAL_RANK, and optionally --local_rank as a command-line argument.
- Elastic training -- When --enable_elastic_training is set, uses PyTorch's elastic agent (DSElasticAgent) with c10d rendezvous instead of direct subprocess spawning. Supports min/max elastic node counts.
- Per-rank logging -- When --enable_each_rank_log is set, redirects each rank's stdout/stderr to a separate timestamped log file.
- Signal handling -- Installs SIGINT and SIGTERM handlers that terminate all child process trees using psutil, ensuring clean shutdown on interruption.
- Process tree termination -- Uses psutil to recursively kill all children of each worker process, preventing zombie processes.
- PID file management -- Optionally writes the launcher PID to /tmp for programmatic process tracking.
This is AUTO_KEEP vendored code from DeepSpeed.
Code Reference
| Field | Value |
|---|---|
| Repository | FlexLLMGen |
| File | benchmark/third_party/DeepSpeed/deepspeed/launcher/launch.py |
| Lines | 1-358 |
Key Functions:
def parse_args(): ...
def terminate_process_tree(pid): ...
def main():
# Decodes world_info, computes ranks, spawns processes
# Handles elastic training via DSElasticAgent
# Installs signal handlers for clean shutdown
...
I/O Contract
Command-Line Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
| --node_rank | int | 0 | Rank of this node in multi-node setup |
| --master_addr | str | 127.0.0.1 | IP address of rank 0 node |
| --master_port | int | 29500 | Communication port for torch.distributed |
| --world_info | str | None | Base64-encoded JSON: {hostname: [gpu_ids]} |
| --module | flag | False | Run training script as Python module (-m) |
| --no_python | flag | False | Execute training script directly without python |
| --enable_elastic_training | flag | False | Use PyTorch elastic agent |
| --no_local_rank | flag | False | Do not pass --local_rank to training script |
| --save_pid | int | 0 | PID of main process for tracking |
| --enable_each_rank_log | str | None | Directory for per-rank log files |
| training_script | str | required | Path to user's training script |
| training_script_args | list | [] | Arguments for the training script |
Environment Variables Set
| Variable | Description |
|---|---|
| CUDA_VISIBLE_DEVICES | Comma-separated GPU IDs for this node |
| MASTER_ADDR | IP address of rank 0 |
| MASTER_PORT | Communication port |
| WORLD_SIZE | Total number of processes across all nodes |
| RANK | Global rank of each spawned process |
| LOCAL_RANK | Local GPU rank on this node |