Principle:Microsoft DeepSpeedExamples DeepSpeed Training Loop
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | DeepSpeed_Training_Loop |
| Repository | Microsoft/DeepSpeedExamples |
| Domains | Deep_Learning, Training, Performance |
| Status | Active |
| Related Implementation | Implementation:Microsoft_DeepSpeedExamples_Main_Training_Loop_SuperOffload |
Overview
A training pattern that uses DeepSpeed engine's managed backward pass and optimizer step for memory-efficient distributed training.
Description
The DeepSpeed training loop replaces PyTorch's manual training pattern with the DeepSpeed engine's managed operations. In standard PyTorch, a training step consists of:
# Standard PyTorch training step
optimizer.zero_grad()
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
With DeepSpeed, this becomes:
# DeepSpeed training step
outputs = model_engine(**batch)
loss = outputs.loss
model_engine.backward(loss)
model_engine.step()
The key differences are:
- No explicit
optimizer.zero_grad()-- The DeepSpeed engine handles gradient zeroing internally as part of thestep()call. model_engine.backward(loss)instead ofloss.backward()-- The engine manages gradient scaling for mixed precision, gradient accumulation across micro-batches, and gradient reduce-scatter for distributed training.model_engine.step()instead ofoptimizer.step()-- The engine coordinates CPU optimizer updates, parameter gathering/scattering, and learning rate scheduling.
Theoretical Basis
The DeepSpeed engine manages four critical aspects of the training loop transparently:
1. Gradient Accumulation
When gradient_accumulation_steps > 1, the engine accumulates gradients across multiple micro-batches before performing an optimizer step. This effectively increases the batch size without increasing memory consumption:
effective_batch_size = train_batch_size = micro_batch_size * gradient_accumulation_steps * num_gpus
The engine tracks the micro-batch index internally and only triggers the optimizer step after the final micro-batch in each accumulation cycle.
2. Loss Scaling for Mixed Precision
When training in BF16 (or FP16), the engine manages loss scaling to prevent gradient underflow:
- For BF16: Static loss scaling (BF16 has wide dynamic range, so dynamic scaling is unnecessary)
- For FP16: Dynamic loss scaling with automatic scale adjustment
The model_engine.backward(loss) call applies the appropriate scaling before calling loss.backward().
3. Distributed Gradient Communication
After the backward pass, the engine performs gradient communication across GPUs:
- ZeRO Stage 3: Reduce-scatter operation -- each GPU sends its gradient contributions and receives only the 1/N partition it owns.
- Communication is bucketed (controlled by
reduce_bucket_size) for efficiency. - With CPU offloading, gradients are transferred to CPU for the optimizer step.
4. CPU Offload Coordination
With ZeRO-3 + CPU offloading, the model_engine.step() call coordinates:
- Transfer gradients from GPU to CPU (for the owned 1/N partition)
- Execute DeepSpeedCPUAdam on CPU (using SIMD-optimized kernels)
- Transfer updated parameters back from CPU to GPU
- All-gather updated parameters when needed for the next forward pass
Performance Metrics
The training loop tracks several performance metrics:
TFLOPS Estimation
For dense (non-MoE) transformer models, TFLOPS is estimated using the formula:
coefficient = 4 if activation_checkpointing else 3
tflops_per_sample = (2 * coefficient * model_size * seq_len
+ 2 * 2 * coefficient * num_layers * hidden_size * seq_len^2) / 1e12
step_tflops = batch_size * tflops_per_sample / step_time
The coefficient is 4 with activation checkpointing (extra forward pass for recomputation) and 3 without (forward + backward = 3x forward compute).
Tokens Per Second
tokens_per_second = (batch_size * sequence_length) / step_time
Warmup Steps
The first N steps (warmup_steps) are excluded from performance timing because they include JIT compilation, memory allocation, and cache warming overhead that would skew measurements.
Training Loop Pattern
The canonical training loop pattern with DeepSpeed:
model_engine.train()
for epoch in range(num_epochs):
for step, batch in enumerate(train_dataloader):
# Move batch to device
batch = {k: v.to(model_engine.device) for k, v in batch.items()}
# Forward pass
outputs = model_engine(**batch)
loss = outputs.loss
# Backward pass (managed by DeepSpeed)
model_engine.backward(loss)
# Optimizer step (managed by DeepSpeed)
model_engine.step()
Logging and Monitoring
The training loop supports two logging mechanisms:
- Console logging -- Periodic log messages (controlled by
log_interval) showing loss, step time, TFLOPS, and tokens/second. - WandB logging -- When enabled, metrics are logged to Weights & Biases for visualization and experiment tracking.
Logged metrics include:
| Metric | Key | Description |
|---|---|---|
| Training loss | train/loss |
Average loss over the log interval |
| Epoch | train/epoch |
Current epoch number |
| Global step | train/global_step |
Total training steps completed |
| Learning rate | train/learning_rate |
Current learning rate |
| Step time | perf/step_time_ms |
Time per training step in milliseconds |
| Tokens/s | perf/tokens_per_second |
Tokens processed per second |
| TFLOPS | perf/tflops |
Achieved TFLOPS (dense models only) |
Benchmarking Mode
The training loop supports a benchmarking mode controlled by bench_steps. When set, training stops after the specified number of steps regardless of epochs. This is useful for performance measurement without running full training.