Principle:FMInference FlexLLMGen DeepSpeed Configuration Autotuning
| Knowledge Sources | |
|---|---|
| Domains | Hyperparameter Tuning, Distributed Training, Systems Optimization, Automation |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Automated configuration tuning for distributed training systematically searches through memory optimization stages, batch sizes, and system parameters using memory-aware feasibility analysis and progressive refinement to find the highest-throughput setup.
Description
Distributed deep learning training involves a large configuration space that interacts with both model characteristics and hardware constraints. Key tunable parameters include:
ZeRO optimization stages partition different components of the training state across GPUs:
- Stage 0: No partitioning (baseline).
- Stage 1: Optimizer states partitioned across data-parallel ranks.
- Stage 2: Gradients also partitioned.
- Stage 3: Model parameters also partitioned.
Each higher stage reduces per-GPU memory but increases communication volume. The optimal stage depends on model size, GPU memory, and interconnect bandwidth.
Micro-batch size determines the amount of work per GPU per step. Larger micro-batches improve GPU utilization (better GEMM efficiency) but consume more activation memory. The optimal micro-batch size is the largest that fits in GPU memory without causing out-of-memory errors, unless throughput plateaus at a smaller size due to communication overlap.
Communication parameters such as reduce_bucket_size and allgather_bucket_size control the granularity of collective operations. Larger buckets amortize launch overhead but may delay communication-computation overlap.
The autotuning strategy combines several search techniques:
- Memory-based feasibility filtering: Before running any experiment, the tuner estimates GPU memory requirements analytically: mem = (params + gradients + optimizer_state) / partition_factor + activation_memory. Infeasible configurations are pruned.
- Binary search for maximum micro-batch size: Rather than trying all possible sizes, a binary search between the minimum runnable size and the analytically estimated maximum efficiently finds the boundary.
- Plateau detection: If increasing the micro-batch size yields less than a threshold improvement (e.g., 1%) in the target metric, the search terminates early, as further increases trade memory for negligible throughput gains.
- Progressive stage exploration: Higher ZeRO stages are only explored if they can potentially improve upon the best result from lower stages. If stage 1 already provides the optimal micro-batch size and throughput, stage 2 exploration is skipped.
- Combinatorial parameter search: Within a ZeRO stage, multiple configuration parameters are varied in a grid. Pruning rules eliminate known-bad combinations before execution.
Usage
Apply this principle when designing automated tuning systems for distributed training where the configuration space is large, experiments are expensive, and memory constraints create hard feasibility boundaries.
Theoretical Basis
Memory model for ZeRO stages: For a model with P parameters using mixed precision (FP16 weights, FP32 optimizer):
- Stage 0 memory per GPU: 2P (params) + 2P (gradients) + 16P (optimizer: FP32 params + FP32 momentum + FP32 variance + FP16 master)
- Stage 1: optimizer memory divided by N (number of GPUs)
- Stage 2: gradient memory also divided by N
- Stage 3: parameter memory also divided by N
Activation memory scales linearly with micro-batch size and depends on model architecture (hidden size, sequence length, number of layers, attention heads). A profiling run measures activation memory for micro-batch size 1, and the tuner scales linearly.
Throughput vs. batch size curve: Training throughput (samples/second) typically follows a concave curve as micro-batch size increases: initially rising steeply as GPU utilization improves, then flattening as compute becomes the bottleneck, and eventually declining if memory pressure causes swapping or reduced occupancy.
Grid search vs. model-based tuning: Grid search guarantees finding the optimum within the search grid but scales exponentially with the number of parameters. Model-based tuning (e.g., Bayesian optimization) uses a surrogate model to predict performance and guide the search toward promising regions, reducing the number of required experiments at the cost of potential suboptimality.
Resource management coordinates experiment scheduling across multiple nodes and GPUs, enabling parallel execution of independent experiments to reduce total tuning time.