Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FMInference FlexLLMGen DeepSpeed Launch

From Leeroopedia


Field Value
Sources Repo: FlexLLMGen
Domains Distributed_Training, Process_Management
Last Updated 2026-02-09 00:00 GMT

Overview

Vendored DeepSpeed per-node launcher that spawns multiple worker sub-processes for distributed training on a single node, managing GPU assignments, environment variables, signal handling, elastic training, and per-rank logging.

Description

launch.py is the per-node component of DeepSpeed's distributed launch system. It runs on each worker node (invoked by the runner) and spawns one worker sub-process per local GPU. Each sub-process receives the correct RANK, LOCAL_RANK, WORLD_SIZE, and CUDA_VISIBLE_DEVICES environment variables.

Key features:

  • Argument parsing -- Accepts node_rank, master_addr, master_port, world_info (base64-encoded JSON mapping hostnames to GPU IDs), and the user's training script with its arguments.
  • Global rank computation -- Decodes the world_info dictionary to compute global ranks from the node list and local GPU IDs. Builds a global_rank_mapping that assigns consecutive global ranks across all nodes.
  • GPU assignment -- Sets CUDA_VISIBLE_DEVICES on each node to the local GPU IDs specified in world_info.
  • Process spawning -- For non-elastic training, spawns one subprocess per local GPU via subprocess.Popen, passing RANK, LOCAL_RANK, and optionally --local_rank as a command-line argument.
  • Elastic training -- When --enable_elastic_training is set, uses PyTorch's elastic agent (DSElasticAgent) with c10d rendezvous instead of direct subprocess spawning. Supports min/max elastic node counts.
  • Per-rank logging -- When --enable_each_rank_log is set, redirects each rank's stdout/stderr to a separate timestamped log file.
  • Signal handling -- Installs SIGINT and SIGTERM handlers that terminate all child process trees using psutil, ensuring clean shutdown on interruption.
  • Process tree termination -- Uses psutil to recursively kill all children of each worker process, preventing zombie processes.
  • PID file management -- Optionally writes the launcher PID to /tmp for programmatic process tracking.

This is AUTO_KEEP vendored code from DeepSpeed.

Code Reference

Field Value
Repository FlexLLMGen
File benchmark/third_party/DeepSpeed/deepspeed/launcher/launch.py
Lines 1-358

Key Functions:

def parse_args(): ...

def terminate_process_tree(pid): ...

def main():
    # Decodes world_info, computes ranks, spawns processes
    # Handles elastic training via DSElasticAgent
    # Installs signal handlers for clean shutdown
    ...

I/O Contract

Command-Line Arguments

Argument Type Default Description
--node_rank int 0 Rank of this node in multi-node setup
--master_addr str 127.0.0.1 IP address of rank 0 node
--master_port int 29500 Communication port for torch.distributed
--world_info str None Base64-encoded JSON: {hostname: [gpu_ids]}
--module flag False Run training script as Python module (-m)
--no_python flag False Execute training script directly without python
--enable_elastic_training flag False Use PyTorch elastic agent
--no_local_rank flag False Do not pass --local_rank to training script
--save_pid int 0 PID of main process for tracking
--enable_each_rank_log str None Directory for per-rank log files
training_script str required Path to user's training script
training_script_args list [] Arguments for the training script

Environment Variables Set

Variable Description
CUDA_VISIBLE_DEVICES Comma-separated GPU IDs for this node
MASTER_ADDR IP address of rank 0
MASTER_PORT Communication port
WORLD_SIZE Total number of processes across all nodes
RANK Global rank of each spawned process
LOCAL_RANK Local GPU rank on this node

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment