Principle:Microsoft DeepSpeedExamples DeepSpeed Training Loop

Metadata

Field	Value
Page Type	Principle
Title	DeepSpeed_Training_Loop
Repository	Microsoft/DeepSpeedExamples
Domains	Deep_Learning, Training, Performance
Status	Active
Related Implementation	Implementation:Microsoft_DeepSpeedExamples_Main_Training_Loop_SuperOffload

Overview

A training pattern that uses DeepSpeed engine's managed backward pass and optimizer step for memory-efficient distributed training.

Description

The DeepSpeed training loop replaces PyTorch's manual training pattern with the DeepSpeed engine's managed operations. In standard PyTorch, a training step consists of:

# Standard PyTorch training step
optimizer.zero_grad()
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()

With DeepSpeed, this becomes:

# DeepSpeed training step
outputs = model_engine(**batch)
loss = outputs.loss
model_engine.backward(loss)
model_engine.step()

The key differences are:

No explicit optimizer.zero_grad() -- The DeepSpeed engine handles gradient zeroing internally as part of the step() call.
model_engine.backward(loss) instead of loss.backward() -- The engine manages gradient scaling for mixed precision, gradient accumulation across micro-batches, and gradient reduce-scatter for distributed training.
model_engine.step() instead of optimizer.step() -- The engine coordinates CPU optimizer updates, parameter gathering/scattering, and learning rate scheduling.

Theoretical Basis

The DeepSpeed engine manages four critical aspects of the training loop transparently:

1. Gradient Accumulation

When gradient_accumulation_steps > 1, the engine accumulates gradients across multiple micro-batches before performing an optimizer step. This effectively increases the batch size without increasing memory consumption:

effective_batch_size = train_batch_size = micro_batch_size * gradient_accumulation_steps * num_gpus

The engine tracks the micro-batch index internally and only triggers the optimizer step after the final micro-batch in each accumulation cycle.

2. Loss Scaling for Mixed Precision

When training in BF16 (or FP16), the engine manages loss scaling to prevent gradient underflow:

For BF16: Static loss scaling (BF16 has wide dynamic range, so dynamic scaling is unnecessary)
For FP16: Dynamic loss scaling with automatic scale adjustment

The model_engine.backward(loss) call applies the appropriate scaling before calling loss.backward().

3. Distributed Gradient Communication

After the backward pass, the engine performs gradient communication across GPUs:

ZeRO Stage 3: Reduce-scatter operation -- each GPU sends its gradient contributions and receives only the 1/N partition it owns.
Communication is bucketed (controlled by reduce_bucket_size) for efficiency.
With CPU offloading, gradients are transferred to CPU for the optimizer step.

4. CPU Offload Coordination

With ZeRO-3 + CPU offloading, the model_engine.step() call coordinates:

Transfer gradients from GPU to CPU (for the owned 1/N partition)
Execute DeepSpeedCPUAdam on CPU (using SIMD-optimized kernels)
Transfer updated parameters back from CPU to GPU
All-gather updated parameters when needed for the next forward pass

Performance Metrics

The training loop tracks several performance metrics:

TFLOPS Estimation

For dense (non-MoE) transformer models, TFLOPS is estimated using the formula:

coefficient = 4 if activation_checkpointing else 3
tflops_per_sample = (2 * coefficient * model_size * seq_len
                     + 2 * 2 * coefficient * num_layers * hidden_size * seq_len^2) / 1e12
step_tflops = batch_size * tflops_per_sample / step_time

The coefficient is 4 with activation checkpointing (extra forward pass for recomputation) and 3 without (forward + backward = 3x forward compute).

Tokens Per Second

tokens_per_second = (batch_size * sequence_length) / step_time

Warmup Steps

The first N steps (warmup_steps) are excluded from performance timing because they include JIT compilation, memory allocation, and cache warming overhead that would skew measurements.

Training Loop Pattern

The canonical training loop pattern with DeepSpeed:

model_engine.train()

for epoch in range(num_epochs):
    for step, batch in enumerate(train_dataloader):
        # Move batch to device
        batch = {k: v.to(model_engine.device) for k, v in batch.items()}

        # Forward pass
        outputs = model_engine(**batch)
        loss = outputs.loss

        # Backward pass (managed by DeepSpeed)
        model_engine.backward(loss)

        # Optimizer step (managed by DeepSpeed)
        model_engine.step()

Logging and Monitoring

The training loop supports two logging mechanisms:

Console logging -- Periodic log messages (controlled by log_interval) showing loss, step time, TFLOPS, and tokens/second.
WandB logging -- When enabled, metrics are logged to Weights & Biases for visualization and experiment tracking.

Logged metrics include:

Metric	Key	Description
Training loss	`train/loss`	Average loss over the log interval
Epoch	`train/epoch`	Current epoch number
Global step	`train/global_step`	Total training steps completed
Learning rate	`train/learning_rate`	Current learning rate
Step time	`perf/step_time_ms`	Time per training step in milliseconds
Tokens/s	`perf/tokens_per_second`	Tokens processed per second
TFLOPS	`perf/tflops`	Achieved TFLOPS (dense models only)

Benchmarking Mode

The training loop supports a benchmarking mode controlled by bench_steps. When set, training stops after the specified number of steps regardless of epochs. This is useful for performance measurement without running full training.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment