Implementation:FMInference FlexLLMGen DeepSpeed Runner

Field	Value
Sources	Repo: FlexLLMGen
Domains	Distributed_Training, Job_Orchestration
Last Updated	2026-02-09 00:00 GMT

Overview

Vendored DeepSpeed runner that serves as the main front-end for launching multi-node distributed training jobs, handling hostfile parsing, resource filtering, multi-node backend selection (PDSH, OpenMPI, MVAPICH, Slurm), autotuning integration, and environment configuration.

Description

runner.py is the top-level entry point for the deepspeed command-line tool. It orchestrates the entire job launch process from a single command invocation, including resource discovery, validation, and delegation to the appropriate multi-node runner backend.

Key features:

Argument parsing -- Accepts hostfile, include/exclude resource filters, num_nodes, num_gpus, master_addr/port, launcher backend selection, module/no_python flags, autotuning mode, elastic training options, and the user script with arguments.
Hostfile parsing -- Reads MPI-style hostfiles (hostname slots=N) into an OrderedDict. Falls back to localhost with torch.cuda.device_count() GPUs if no hostfile is found.
Resource filtering -- Supports --include and --exclude flags with the syntax NODE[:SLOT,SLOT]@NODE[:SLOT,SLOT] to select or reject specific nodes and GPU slots. Also supports --num_nodes and --num_gpus for simpler subsetting. Respects CUDA_VISIBLE_DEVICES when running on a single node.
SSH validation -- For multi-node jobs, validates passwordless SSH to the first host before proceeding.
World info encoding -- Encodes the active resource map as base64 JSON (world_info) for passing to the per-node launcher.
Single-node path -- For single-node jobs, directly invokes deepspeed.launcher.launch as a subprocess with the encoded world info.
Multi-node backends -- For multi-node jobs, delegates to one of four runner backends:
- PDSHRunner -- Uses pdsh for parallel SSH-based launching.
- OpenMPIRunner -- Uses mpirun for OpenMPI-based launching.
- MVAPICHRunner -- Uses MVAPICH's mpirun_rsh.
- SlurmRunner -- Uses srun for Slurm-based launching.
Environment export -- Collects environment variables matching EXPORT_ENVS prefixes (MLFLOW, NCCL, PYTHON, MV2, UCX) and exports them to remote nodes. Also reads .deepspeed_env files for additional exports.
Autotuning integration -- When --autotuning is specified, delegates to run_autotuning which runs the Autotuner to discover optimal configurations before (or instead of) running the job.
Signal handling -- For PDSH launcher, installs signal handlers that send SIGINT/SIGTERM to the main process and invoke a kill command on remote nodes.

This is AUTO_KEEP vendored code from DeepSpeed.

Code Reference

Field	Value
Repository	FlexLLMGen
File	benchmark/third_party/DeepSpeed/deepspeed/launcher/runner.py
Lines	1-533

Key Functions:

def parse_args(args=None): ...
def fetch_hostfile(hostfile_path): ...
def parse_resource_filter(host_info, include_str="", exclude_str=""): ...
def parse_inclusion_exclusion(resource_pool, inclusion, exclusion): ...
def encode_world_info(world_info): ...
def run_autotuning(args, active_resources): ...
def main(args=None): ...

I/O Contract

Command-Line Arguments

Argument	Type	Default	Description
-H/--hostfile	str	/job/hostfile	MPI-style hostfile path
-i/--include	str	""	Resource inclusion filter (NODE[@NODE]:SLOT,SLOT)
-e/--exclude	str	""	Resource exclusion filter (mutually exclusive with include)
--num_nodes	int	-1	Number of nodes to use (top N from hostfile)
--num_gpus	int	-1	Max GPUs per node
--launcher	str	pdsh	Backend: pdsh, openmpi, mvapich, slurm
--autotuning	str	""	"tune" or "run" to enable autotuning
--elastic_training	flag	False	Enable elastic training
user_script	str	required	Training script path
user_args	list	[]	Training script arguments

Outputs

Output	Type	Description
exit code	int	0 on success, non-zero on failure (propagated from child process)

Related Pages

Principle:FMInference_FlexLLMGen_Distributed_Job_Runner

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment