Principle:FMInference FlexLLMGen Autotuning Experiment Scheduling

Field	Value
Sources	Paper: FlexGen, DeepSpeed Autotuning Documentation
Domains	Autotuning, Distributed_Training, Resource_Management
Last Updated	2026-02-09 00:00 GMT

Overview

A resource-aware scheduling strategy that systematically explores DeepSpeed configuration spaces by running multiple training experiments across GPU nodes in parallel, selecting the configuration that maximizes throughput.

Description

Autotuning experiment scheduling addresses the challenge of finding optimal DeepSpeed configurations for distributed training workloads. The configuration space (batch sizes, ZeRO stages, offloading settings, etc.) is too large for manual exploration, so an automated scheduler runs candidate configurations as independent experiments and compares their measured throughput.

Key characteristics of this approach:

Queue-based dispatching -- Experiments are placed in a FIFO queue. A main loop repeatedly dequeues experiments and attempts to allocate GPU resources. If resources are unavailable, the experiment is returned to the front of the queue, and the scheduler waits for running experiments to finish.
Slot-based resource management -- Each node tracks a list of idle GPU slots. Experiments request a specific number of GPUs and nodes. Slots are reserved atomically and restored when experiments complete. This prevents over-subscription of GPU resources.
Thread-per-experiment execution -- Each experiment runs in its own thread, allowing multiple experiments to execute concurrently when sufficient resources are available. This maximizes hardware utilization during the search.
Idempotent re-execution -- Experiments that have already completed successfully (with a result file present) are skipped. Interrupted experiments (detected by KeyboardInterrupt in stderr) are re-executed. This makes the scheduler fault-tolerant and restartable.
Throughput-based selection -- After all experiments finish, metric files are parsed to find the configuration that achieved the highest throughput, which becomes the recommended DeepSpeed configuration.
Distributed cleanup -- After each experiment, pdsh is used to kill residual processes across all participating nodes, preventing resource leaks.

Usage

Use autotuning experiment scheduling when deploying DeepSpeed training workloads on new hardware or with new model architectures where the optimal configuration is unknown. The scheduler is invoked via the DeepSpeed CLI with the --autotuning flag.

The scheduling approach is most valuable when:

The configuration space includes multiple ZeRO stages and offloading options.
The hardware has heterogeneous memory tiers (GPU, CPU, NVMe).
Batch size tuning interacts with memory optimizations in non-obvious ways.

Theoretical Basis

The scheduling strategy is a form of grid search over a discretized configuration space. Each candidate configuration is evaluated by running the actual training workload and measuring wall-clock throughput, which captures all hardware-specific effects (memory bandwidth, kernel efficiency, communication overhead) that analytical models cannot accurately predict.

The scheduler uses a greedy resource allocation policy: experiments are dispatched in FIFO order as soon as resources become available. This is not globally optimal for minimizing total search time, but it is simple, predictable, and provides good utilization when experiments have similar resource requirements and durations.

Related Pages

Implementation:FMInference_FlexLLMGen_DeepSpeed_Autotuning_Scheduler

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment