Implementation:FMInference FlexLLMGen DeepSpeed Autotuning Scheduler

Field	Value
Sources	Repo: FlexLLMGen
Domains	Autotuning, Distributed_Training, Resource_Management
Last Updated	2026-02-09 00:00 GMT

Overview

Vendored DeepSpeed experiment scheduler that manages the lifecycle of autotuning experiments, including resource allocation, multi-threaded job execution, and result parsing.

Description

scheduler.py provides the ResourceManager class, which orchestrates the execution of DeepSpeed autotuning experiments across a pool of GPU nodes. It operates as a multi-threaded scheduler where a main loop dispatches experiments from a queue onto available GPU resources, and each experiment runs in its own thread.

The module also defines the supporting classes Node and Reservation for GPU slot management, and standalone functions for experiment execution (run_experiment) and cleanup (clean_up).

Key behaviors:

Experiment queueing -- Reads experiment configuration files (hjson format), assigns experiment IDs, and enqueues them. Already-completed experiments are skipped unless they were interrupted.
Resource allocation -- The resource_request method attempts to reserve GPU slots across nodes. If resources are insufficient, the experiment is placed back at the front of the queue.
Threaded execution -- Each experiment runs in a separate thread via run_job, which builds a DeepSpeed launch command and invokes it as a subprocess.
Result parsing -- After all experiments complete, parse_results reads metric files to identify the configuration with the highest throughput.
Cleanup -- Uses pdsh to kill experiment processes across distributed nodes after completion or failure.

This is AUTO_KEEP vendored code from DeepSpeed, included in the FlexLLMGen benchmark infrastructure.

Code Reference

Field	Value
Repository	FlexLLMGen
File	benchmark/third_party/DeepSpeed/deepspeed/autotuning/scheduler.py
Lines	1-444

Key Classes and Functions:

class ResourceManager:
    def __init__(self, args, hosts, num_gpus_per_node, results_dir, exps_dir, arg_mappings):
        ...

    def schedule_experiments(self, exp_paths):
        ...

    def run_job(self, exp: dict, reservations):
        ...

    def run(self):
        ...

    def parse_results(self, metric):
        ...

class Node:
    def __init__(self, host, max_slots):
        ...

class Reservation:
    def __init__(self, node, slots):
        ...

def run_experiment(exp: dict, reservations, user_script, user_args):
    ...

def clean_up(exp: dict, reservations):
    ...

I/O Contract

Inputs

Parameter	Type	Required	Description
args	Namespace	Yes	Command-line arguments including master_port, user_script, user_args
hosts	list	Yes	List of hostnames in the resource pool
num_gpus_per_node	int	Yes	Number of GPU slots per node
results_dir	str	Yes	Directory to store experiment results
exps_dir	str	Yes	Directory containing experiment configuration files
arg_mappings	dict	No	Mapping of experiment config keys to user argument flags

Outputs

Output	Type	Description
finished_experiments	dict	Mapping of experiment IDs to (experiment_config, error) tuples
parse_results return	tuple	(best_experiment_config, max_throughput) from all finished experiments

Internal Workflow

schedule_experiments loads hjson experiment configs from file paths, skipping duplicates and completed runs.
run iterates the experiment queue, calling resource_request to allocate GPU slots across nodes.
When resources are available, run_job spawns a thread calling run_experiment.
run_experiment serializes the DeepSpeed config (base64-encoded), constructs a deepspeed CLI command, and runs it via subprocess.
experiment_check periodically joins threads, collects results, and restores GPU slots.
parse_results reads metric files from finished experiments to select the optimal configuration.
clean_up uses pdsh to kill processes on remote nodes after experiment completion.

Related Pages

Principle:FMInference_FlexLLMGen_Autotuning_Experiment_Scheduling

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment