Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FMInference FlexLLMGen DeepSpeed Runner

From Leeroopedia


Field Value
Sources Repo: FlexLLMGen
Domains Distributed_Training, Job_Orchestration
Last Updated 2026-02-09 00:00 GMT

Overview

Vendored DeepSpeed runner that serves as the main front-end for launching multi-node distributed training jobs, handling hostfile parsing, resource filtering, multi-node backend selection (PDSH, OpenMPI, MVAPICH, Slurm), autotuning integration, and environment configuration.

Description

runner.py is the top-level entry point for the deepspeed command-line tool. It orchestrates the entire job launch process from a single command invocation, including resource discovery, validation, and delegation to the appropriate multi-node runner backend.

Key features:

  • Argument parsing -- Accepts hostfile, include/exclude resource filters, num_nodes, num_gpus, master_addr/port, launcher backend selection, module/no_python flags, autotuning mode, elastic training options, and the user script with arguments.
  • Hostfile parsing -- Reads MPI-style hostfiles (hostname slots=N) into an OrderedDict. Falls back to localhost with torch.cuda.device_count() GPUs if no hostfile is found.
  • Resource filtering -- Supports --include and --exclude flags with the syntax NODE[:SLOT,SLOT]@NODE[:SLOT,SLOT] to select or reject specific nodes and GPU slots. Also supports --num_nodes and --num_gpus for simpler subsetting. Respects CUDA_VISIBLE_DEVICES when running on a single node.
  • SSH validation -- For multi-node jobs, validates passwordless SSH to the first host before proceeding.
  • World info encoding -- Encodes the active resource map as base64 JSON (world_info) for passing to the per-node launcher.
  • Single-node path -- For single-node jobs, directly invokes deepspeed.launcher.launch as a subprocess with the encoded world info.
  • Multi-node backends -- For multi-node jobs, delegates to one of four runner backends:
    • PDSHRunner -- Uses pdsh for parallel SSH-based launching.
    • OpenMPIRunner -- Uses mpirun for OpenMPI-based launching.
    • MVAPICHRunner -- Uses MVAPICH's mpirun_rsh.
    • SlurmRunner -- Uses srun for Slurm-based launching.
  • Environment export -- Collects environment variables matching EXPORT_ENVS prefixes (MLFLOW, NCCL, PYTHON, MV2, UCX) and exports them to remote nodes. Also reads .deepspeed_env files for additional exports.
  • Autotuning integration -- When --autotuning is specified, delegates to run_autotuning which runs the Autotuner to discover optimal configurations before (or instead of) running the job.
  • Signal handling -- For PDSH launcher, installs signal handlers that send SIGINT/SIGTERM to the main process and invoke a kill command on remote nodes.

This is AUTO_KEEP vendored code from DeepSpeed.

Code Reference

Field Value
Repository FlexLLMGen
File benchmark/third_party/DeepSpeed/deepspeed/launcher/runner.py
Lines 1-533

Key Functions:

def parse_args(args=None): ...
def fetch_hostfile(hostfile_path): ...
def parse_resource_filter(host_info, include_str="", exclude_str=""): ...
def parse_inclusion_exclusion(resource_pool, inclusion, exclusion): ...
def encode_world_info(world_info): ...
def run_autotuning(args, active_resources): ...
def main(args=None): ...

I/O Contract

Command-Line Arguments

Argument Type Default Description
-H/--hostfile str /job/hostfile MPI-style hostfile path
-i/--include str "" Resource inclusion filter (NODE[@NODE]:SLOT,SLOT)
-e/--exclude str "" Resource exclusion filter (mutually exclusive with include)
--num_nodes int -1 Number of nodes to use (top N from hostfile)
--num_gpus int -1 Max GPUs per node
--launcher str pdsh Backend: pdsh, openmpi, mvapich, slurm
--autotuning str "" "tune" or "run" to enable autotuning
--elastic_training flag False Enable elastic training
user_script str required Training script path
user_args list [] Training script arguments

Outputs

Output Type Description
exit code int 0 on success, non-zero on failure (propagated from child process)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment