Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Microsoft DeepSpeedExamples DeepSpeed Training Loop

From Leeroopedia


Metadata

Field Value
Page Type Principle
Title DeepSpeed_Training_Loop
Repository Microsoft/DeepSpeedExamples
Domains Deep_Learning, Training, Performance
Status Active
Related Implementation Implementation:Microsoft_DeepSpeedExamples_Main_Training_Loop_SuperOffload

Overview

A training pattern that uses DeepSpeed engine's managed backward pass and optimizer step for memory-efficient distributed training.

Description

The DeepSpeed training loop replaces PyTorch's manual training pattern with the DeepSpeed engine's managed operations. In standard PyTorch, a training step consists of:

# Standard PyTorch training step
optimizer.zero_grad()
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()

With DeepSpeed, this becomes:

# DeepSpeed training step
outputs = model_engine(**batch)
loss = outputs.loss
model_engine.backward(loss)
model_engine.step()

The key differences are:

  • No explicit optimizer.zero_grad() -- The DeepSpeed engine handles gradient zeroing internally as part of the step() call.
  • model_engine.backward(loss) instead of loss.backward() -- The engine manages gradient scaling for mixed precision, gradient accumulation across micro-batches, and gradient reduce-scatter for distributed training.
  • model_engine.step() instead of optimizer.step() -- The engine coordinates CPU optimizer updates, parameter gathering/scattering, and learning rate scheduling.

Theoretical Basis

The DeepSpeed engine manages four critical aspects of the training loop transparently:

1. Gradient Accumulation

When gradient_accumulation_steps > 1, the engine accumulates gradients across multiple micro-batches before performing an optimizer step. This effectively increases the batch size without increasing memory consumption:

effective_batch_size = train_batch_size = micro_batch_size * gradient_accumulation_steps * num_gpus

The engine tracks the micro-batch index internally and only triggers the optimizer step after the final micro-batch in each accumulation cycle.

2. Loss Scaling for Mixed Precision

When training in BF16 (or FP16), the engine manages loss scaling to prevent gradient underflow:

  • For BF16: Static loss scaling (BF16 has wide dynamic range, so dynamic scaling is unnecessary)
  • For FP16: Dynamic loss scaling with automatic scale adjustment

The model_engine.backward(loss) call applies the appropriate scaling before calling loss.backward().

3. Distributed Gradient Communication

After the backward pass, the engine performs gradient communication across GPUs:

  • ZeRO Stage 3: Reduce-scatter operation -- each GPU sends its gradient contributions and receives only the 1/N partition it owns.
  • Communication is bucketed (controlled by reduce_bucket_size) for efficiency.
  • With CPU offloading, gradients are transferred to CPU for the optimizer step.

4. CPU Offload Coordination

With ZeRO-3 + CPU offloading, the model_engine.step() call coordinates:

  1. Transfer gradients from GPU to CPU (for the owned 1/N partition)
  2. Execute DeepSpeedCPUAdam on CPU (using SIMD-optimized kernels)
  3. Transfer updated parameters back from CPU to GPU
  4. All-gather updated parameters when needed for the next forward pass

Performance Metrics

The training loop tracks several performance metrics:

TFLOPS Estimation

For dense (non-MoE) transformer models, TFLOPS is estimated using the formula:

coefficient = 4 if activation_checkpointing else 3
tflops_per_sample = (2 * coefficient * model_size * seq_len
                     + 2 * 2 * coefficient * num_layers * hidden_size * seq_len^2) / 1e12
step_tflops = batch_size * tflops_per_sample / step_time

The coefficient is 4 with activation checkpointing (extra forward pass for recomputation) and 3 without (forward + backward = 3x forward compute).

Tokens Per Second

tokens_per_second = (batch_size * sequence_length) / step_time

Warmup Steps

The first N steps (warmup_steps) are excluded from performance timing because they include JIT compilation, memory allocation, and cache warming overhead that would skew measurements.

Training Loop Pattern

The canonical training loop pattern with DeepSpeed:

model_engine.train()

for epoch in range(num_epochs):
    for step, batch in enumerate(train_dataloader):
        # Move batch to device
        batch = {k: v.to(model_engine.device) for k, v in batch.items()}

        # Forward pass
        outputs = model_engine(**batch)
        loss = outputs.loss

        # Backward pass (managed by DeepSpeed)
        model_engine.backward(loss)

        # Optimizer step (managed by DeepSpeed)
        model_engine.step()

Logging and Monitoring

The training loop supports two logging mechanisms:

  • Console logging -- Periodic log messages (controlled by log_interval) showing loss, step time, TFLOPS, and tokens/second.
  • WandB logging -- When enabled, metrics are logged to Weights & Biases for visualization and experiment tracking.

Logged metrics include:

Metric Key Description
Training loss train/loss Average loss over the log interval
Epoch train/epoch Current epoch number
Global step train/global_step Total training steps completed
Learning rate train/learning_rate Current learning rate
Step time perf/step_time_ms Time per training step in milliseconds
Tokens/s perf/tokens_per_second Tokens processed per second
TFLOPS perf/tflops Achieved TFLOPS (dense models only)

Benchmarking Mode

The training loop supports a benchmarking mode controlled by bench_steps. When set, training stops after the specified number of steps regardless of epochs. This is useful for performance measurement without running full training.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment