Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:FMInference FlexLLMGen Distributed Job Runner

From Leeroopedia


Field Value
Sources Paper: FlexGen, DeepSpeed Launcher Documentation
Domains Distributed_Training, Job_Orchestration
Last Updated 2026-02-09 00:00 GMT

Overview

A job orchestration pattern that provides a single-command entry point for launching distributed training across multi-node GPU clusters, abstracting away the complexity of resource discovery, node selection, environment setup, and multi-node launcher backend selection.

Description

Distributed job runners solve the problem of going from deepspeed train.py --deepspeed_config ds_config.json to N processes running across M nodes, each correctly configured for distributed communication. The runner handles the entire orchestration pipeline:

  • Resource discovery -- The runner reads an MPI-style hostfile to discover available nodes and their GPU counts. If no hostfile exists, it falls back to local resources detected via torch.cuda.device_count(). This dual-path design supports both cluster and single-machine deployments with the same command.
  • Resource filtering -- Include/exclude filters provide fine-grained control over which nodes and GPU slots participate. The syntax (worker-0@worker-1:0,2) allows targeting specific machines and GPUs. This is essential for shared clusters where not all resources are available or desired.
  • Backend abstraction -- The runner delegates multi-node execution to interchangeable backend runners (PDSH, OpenMPI, MVAPICH, Slurm). Each backend implements a common interface (get_cmd, add_export, backend_exists) while using the cluster's native job launching mechanism. This makes DeepSpeed portable across cluster types.
  • Environment propagation -- The runner collects relevant environment variables (NCCL settings, Python paths, ML framework variables) and ensures they are exported to all remote nodes. It also reads .deepspeed_env files for site-specific configuration.
  • World info encoding -- The active resource map is serialized as base64 JSON and passed to the per-node launcher via command-line arguments. This avoids the need for shared filesystem access or distributed configuration services.
  • Autotuning integration -- The runner optionally invokes the DeepSpeed autotuner before (--autotuning tune) or as part of (--autotuning run) the job launch, enabling automatic configuration optimization.
  • Elastic training -- When enabled, the runner configures min/max node counts and delegates to PyTorch's elastic agent for dynamic scaling.

Usage

Use the distributed job runner as the standard entry point for all DeepSpeed training and inference jobs. It replaces the need for custom launch scripts, mpirun commands, or manual environment setup.

Key decision points:

  • Launcher backend -- Use PDSH for simple SSH-based clusters. Use OpenMPI/MVAPICH for MPI-native clusters. Use Slurm for HPC environments with a job scheduler.
  • Resource selection -- Use hostfile for persistent cluster configurations. Use --include/--exclude for ad-hoc resource selection. Use --num_nodes/--num_gpus for simple subsetting.
  • Single vs. multi-node -- The runner automatically detects single-node execution and takes a simplified code path, avoiding unnecessary SSH checks.

Theoretical Basis

The runner implements the Facade pattern, providing a simple interface to the complex subsystem of distributed job launching. It follows the Strategy pattern for backend selection, allowing the launcher backend to be chosen at runtime.

The two-tier architecture (runner + per-node launcher) maps to the hierarchical process management model common in HPC systems:

User command
  -> Runner (rank 0 node): resource discovery, backend selection
    -> Backend (pdsh/mpi/slurm): cross-node process creation
      -> Launcher (each node): per-GPU process spawning
        -> Worker (each GPU): training script execution

This hierarchy ensures clean separation of concerns: cross-node orchestration, intra-node process management, and training script execution are each handled by a dedicated component.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment