Implementation:FMInference FlexLLMGen DeepSpeed Autotuner
| Knowledge Sources | |
|---|---|
| Domains | Deep Learning, Hyperparameter Tuning, Distributed Training, Automation |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
The DeepSpeed Autotuner class automatically discovers optimal training configurations by systematically exploring ZeRO optimization stages, micro-batch sizes, and other DeepSpeed parameters through resource-managed experiment execution.
Description
The Autotuner class implements a multi-phase automated tuning system that finds the best DeepSpeed configuration for a given model and hardware setup. The tuning process proceeds as follows:
Phase 1 - Model profiling: A profiling run (model_info_profile_run()) executes the model with minimal memory configuration to collect metadata: number of parameters, trainable parameters, and activation memory per GPU per micro-batch. This information drives subsequent memory-aware decisions.
Phase 2 - ZeRO stage exploration: The tuner iterates through ZeRO stages 0, 1, 2, and 3, checking memory feasibility at each stage:
- Memory estimation: get_instantiation_memory_required_per_gpu() calculates the memory needed for parameters, gradients, and optimizer states at each ZeRO stage, accounting for how each stage partitions these across GPUs.
- Feasibility check: A stage is explored only if estimated memory requirement + activation memory fits within GPU memory.
- Progressive exploration: Higher ZeRO stages are explored only if they can improve upon the best result found at lower stages.
Phase 3 - Micro-batch size tuning (per ZeRO stage): tune_space() uses a combination of:
- Binary search to find the maximum runnable micro-batch size.
- Grid search over a list of candidate micro-batch sizes.
- Plateau detection (get_plauteu_mbs()) to stop tuning when throughput stops improving.
- Fine-grained search around the maximum to find the sweet spot where memory pressure does not degrade performance.
Phase 4 - Configuration parameter tuning: For each viable ZeRO stage, _generate_experiments() creates experiment configurations by combining user-defined parameter ranges (e.g., reduce_bucket_size, allgather_bucket_size) with template configurations per ZeRO stage. Experiments are generated via get_all_configs() and pruned via prune_configs().
Phase 5 - Experiment execution: The ResourceManager schedules and runs experiments on available GPU resources. Three tuner strategies are supported:
- GridSearchTuner: Exhaustive search over all configurations.
- RandomTuner: Random sampling of the configuration space.
- ModelBasedTuner: Uses a model to predict performance and guide exploration.
Result management: Records are maintained per tuning space with the best experiment, metric value, and experiment count. The write_optimal_config() method saves the best configuration as a JSON file and command script for reproduction.
Usage
The Autotuner is invoked by the DeepSpeed launcher when the autotuning section is present in the DeepSpeed configuration. It requires no code changes from the user beyond providing the configuration.
Code Reference
Source Location
- Repository: FMInference_FlexLLMGen
- File: benchmark/third_party/DeepSpeed/deepspeed/autotuning/autotuner.py
- Lines: 1-1153
Signature
class Autotuner:
def __init__(self, args, active_resources):
"""Initialize the Autotuner with user args and available GPU resources."""
def tune(self):
"""Main tuning entry point: tunes ZeRO stages and micro batch sizes."""
def tune_space(self, tuning_space, prev_max_mbs=0,
prev_best_mbs=0, prev_best_metric_val=0):
"""Tune a specific configuration space (ZeRO stage + parameters)."""
def model_info_profile_run(self):
"""Profile the model to collect parameter count and activation memory."""
def write_optimal_config(self):
"""Save the best configuration found to disk."""
def run_after_tuning(self):
"""Launch training with the discovered optimal configuration."""
def print_tuning_results(self):
"""Print a tabulated summary of tuning results."""
Import
from deepspeed.autotuning.autotuner import Autotuner
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| args | object | Yes | Launcher arguments containing user_args, num_nodes, num_gpus, and other CLI parameters. |
| active_resources | dict | Yes | Dictionary mapping hostnames to GPU slot lists, e.g., {"worker-0": "0,1,2,3"}. |
| user_config | dict | Derived | DeepSpeed JSON configuration extracted from args.user_args (--deepspeed_config). |
Outputs
| Name | Type | Description |
|---|---|---|
| ds_config_optimal.json | File | The optimal DeepSpeed configuration found by autotuning, saved to results_dir. |
| cmd_optimal.txt | File | The launch command for the optimal configuration. |
| summary.txt | File | Tabulated summary of all tuning spaces and their best results. |
| records | dict | In-memory dictionary mapping tuning space names to lists of (experiment, metric_val, num_exps) tuples. |
Usage Examples
# Typically invoked by the DeepSpeed launcher, not directly
from deepspeed.autotuning.autotuner import Autotuner
active_resources = {
"worker-0": [0, 1, 2, 3, 4, 5, 6, 7],
"worker-1": [0, 1, 2, 3, 4, 5, 6, 7],
}
autotuner = Autotuner(args, active_resources)
autotuner.tune()
autotuner.print_tuning_results()
autotuner.write_optimal_config()
autotuner.run_after_tuning()
// DeepSpeed config enabling autotuning
{
"autotuning": {
"enabled": true,
"exps_dir": "./autotuning_exps",
"results_dir": "./autotuning_results",
"max_train_batch_size": 2048,
"max_train_micro_batch_size_per_gpu": 64,
"num_tuning_micro_batch_sizes": 6
},
"zero_optimization": {
"stage": [0, 1, 2, 3]
}
}