Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FMInference FlexLLMGen DeepSpeed Autotuner

From Leeroopedia


Knowledge Sources
Domains Deep Learning, Hyperparameter Tuning, Distributed Training, Automation
Last Updated 2026-02-09 12:00 GMT

Overview

The DeepSpeed Autotuner class automatically discovers optimal training configurations by systematically exploring ZeRO optimization stages, micro-batch sizes, and other DeepSpeed parameters through resource-managed experiment execution.

Description

The Autotuner class implements a multi-phase automated tuning system that finds the best DeepSpeed configuration for a given model and hardware setup. The tuning process proceeds as follows:

Phase 1 - Model profiling: A profiling run (model_info_profile_run()) executes the model with minimal memory configuration to collect metadata: number of parameters, trainable parameters, and activation memory per GPU per micro-batch. This information drives subsequent memory-aware decisions.

Phase 2 - ZeRO stage exploration: The tuner iterates through ZeRO stages 0, 1, 2, and 3, checking memory feasibility at each stage:

  • Memory estimation: get_instantiation_memory_required_per_gpu() calculates the memory needed for parameters, gradients, and optimizer states at each ZeRO stage, accounting for how each stage partitions these across GPUs.
  • Feasibility check: A stage is explored only if estimated memory requirement + activation memory fits within GPU memory.
  • Progressive exploration: Higher ZeRO stages are explored only if they can improve upon the best result found at lower stages.

Phase 3 - Micro-batch size tuning (per ZeRO stage): tune_space() uses a combination of:

  • Binary search to find the maximum runnable micro-batch size.
  • Grid search over a list of candidate micro-batch sizes.
  • Plateau detection (get_plauteu_mbs()) to stop tuning when throughput stops improving.
  • Fine-grained search around the maximum to find the sweet spot where memory pressure does not degrade performance.

Phase 4 - Configuration parameter tuning: For each viable ZeRO stage, _generate_experiments() creates experiment configurations by combining user-defined parameter ranges (e.g., reduce_bucket_size, allgather_bucket_size) with template configurations per ZeRO stage. Experiments are generated via get_all_configs() and pruned via prune_configs().

Phase 5 - Experiment execution: The ResourceManager schedules and runs experiments on available GPU resources. Three tuner strategies are supported:

  • GridSearchTuner: Exhaustive search over all configurations.
  • RandomTuner: Random sampling of the configuration space.
  • ModelBasedTuner: Uses a model to predict performance and guide exploration.

Result management: Records are maintained per tuning space with the best experiment, metric value, and experiment count. The write_optimal_config() method saves the best configuration as a JSON file and command script for reproduction.

Usage

The Autotuner is invoked by the DeepSpeed launcher when the autotuning section is present in the DeepSpeed configuration. It requires no code changes from the user beyond providing the configuration.

Code Reference

Source Location

Signature

class Autotuner:
    def __init__(self, args, active_resources):
        """Initialize the Autotuner with user args and available GPU resources."""

    def tune(self):
        """Main tuning entry point: tunes ZeRO stages and micro batch sizes."""

    def tune_space(self, tuning_space, prev_max_mbs=0,
                   prev_best_mbs=0, prev_best_metric_val=0):
        """Tune a specific configuration space (ZeRO stage + parameters)."""

    def model_info_profile_run(self):
        """Profile the model to collect parameter count and activation memory."""

    def write_optimal_config(self):
        """Save the best configuration found to disk."""

    def run_after_tuning(self):
        """Launch training with the discovered optimal configuration."""

    def print_tuning_results(self):
        """Print a tabulated summary of tuning results."""

Import

from deepspeed.autotuning.autotuner import Autotuner

I/O Contract

Inputs

Name Type Required Description
args object Yes Launcher arguments containing user_args, num_nodes, num_gpus, and other CLI parameters.
active_resources dict Yes Dictionary mapping hostnames to GPU slot lists, e.g., {"worker-0": "0,1,2,3"}.
user_config dict Derived DeepSpeed JSON configuration extracted from args.user_args (--deepspeed_config).

Outputs

Name Type Description
ds_config_optimal.json File The optimal DeepSpeed configuration found by autotuning, saved to results_dir.
cmd_optimal.txt File The launch command for the optimal configuration.
summary.txt File Tabulated summary of all tuning spaces and their best results.
records dict In-memory dictionary mapping tuning space names to lists of (experiment, metric_val, num_exps) tuples.

Usage Examples

# Typically invoked by the DeepSpeed launcher, not directly
from deepspeed.autotuning.autotuner import Autotuner

active_resources = {
    "worker-0": [0, 1, 2, 3, 4, 5, 6, 7],
    "worker-1": [0, 1, 2, 3, 4, 5, 6, 7],
}

autotuner = Autotuner(args, active_resources)
autotuner.tune()
autotuner.print_tuning_results()
autotuner.write_optimal_config()
autotuner.run_after_tuning()
// DeepSpeed config enabling autotuning
{
    "autotuning": {
        "enabled": true,
        "exps_dir": "./autotuning_exps",
        "results_dir": "./autotuning_results",
        "max_train_batch_size": 2048,
        "max_train_micro_batch_size_per_gpu": 64,
        "num_tuning_micro_batch_sizes": 6
    },
    "zero_optimization": {
        "stage": [0, 1, 2, 3]
    }
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment