Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Deepspeedai DeepSpeed Pipeline Training Schedule

From Leeroopedia


Overview

The 1F1B (one-forward-one-backward) micro-batch scheduling algorithm that overlaps computation across pipeline stages to minimize idle time (pipeline bubble).

Detailed Description

Pipeline training uses a schedule to orchestrate micro-batch execution across stages. The 1F1B schedule first fills the pipeline with forward passes (warmup phase), then alternates forward and backward passes (steady state), and finally drains remaining backward passes (cooldown). Each micro-batch's activations are sent to the next stage via point-to-point communication. The train_batch() method executes one complete schedule across all micro-batches and returns the aggregated loss.

Schedule Phases

Phase Description Operations
Warmup Fill the pipeline with forward passes Forward micro-batches enter the pipeline, activations flow from stage 0 to stage S-1
Steady State Alternate 1 forward and 1 backward per step Each step processes one new forward micro-batch and one backward micro-batch, overlapping compute
Cooldown Drain remaining backward passes No new forward passes; remaining micro-batches complete their backward passes

Instruction Types

The schedule generates sequences of PipeInstruction objects at each step:

  • LoadMicroBatch: Load the next micro-batch from the data iterator into a pipeline buffer (first and last stages only).
  • ForwardPass: Execute the local stage's forward computation on a buffer.
  • BackwardPass: Execute the local stage's backward computation on a buffer.
  • SendActivation: Send activations from a buffer to the next stage.
  • RecvActivation: Receive activations from the previous stage into a buffer.
  • SendGrad: Send gradients from a buffer to the previous stage.
  • RecvGrad: Receive gradients from the next stage into a buffer.
  • ReduceGrads: Allreduce gradients across data-parallel ranks (end of batch only).
  • ReduceTiedGrads: Allreduce gradients of tied weights across pipeline stages (end of batch only).
  • OptimizerStep: Execute the optimizer step (end of batch only).

Buffer Management

The schedule uses a cyclic buffer allocation strategy. Each stage needs at most min(S - stage_id, M) buffers (where S is total stages, M is micro-batches), with a minimum of 2. Earlier stages need more buffers because they have more in-flight micro-batches. The _buffer_idx() method maps micro-batch IDs to buffer indices using modular arithmetic.

Communication Ordering

To avoid deadlocks, even-numbered and odd-numbered stages alternate the order of send/recv operations:

  • Even stages: Send first, then receive (for forward communication).
  • Odd stages: Receive first, then send (for forward communication).

This ensures that each send on one stage is matched by a corresponding recv on the adjacent stage.

train_batch() Execution Flow

  1. Set the model to training mode.
  2. Create a TrainSchedule with the current micro-batch count and stage configuration.
  3. Execute the schedule via _exec_schedule(), which iterates over schedule steps and dispatches each PipeInstruction to the corresponding handler method.
  4. Aggregate the total loss: scale by gradient accumulation steps, average across data-parallel ranks, and broadcast to all pipeline stages.
  5. Return the aggregated loss.

Theoretical Basis

1F1B Schedule

The 1F1B schedule for S stages and M micro-batches operates as follows:

For the training schedule, total steps = 2 * (M + S - 1), covering both forward and backward passes. The schedule maps each step to a (micro_batch_id, is_forward) pair using the stage's parity (even/odd) to determine communication ordering.

Pipeline Bubble Analysis

  • Total compute slots per stage: 2 * M (M forward + M backward)
  • Idle slots per stage: 2 * (S - 1)
  • Bubble ratio: (S - 1) / (M + S - 1)

To keep bubble overhead below 10%, one needs M >= 9 * (S - 1). For example, with 4 stages, at least 27 micro-batches are needed.

Gradient Accumulation Equivalence

Pipeline parallelism with M micro-batches is mathematically equivalent to data-parallel training with gradient accumulation over M steps, followed by an optimizer step. The gradients from individual micro-batches are accumulated, and the optimizer step occurs only after all M micro-batches have completed their forward and backward passes.

References

Related Pages

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment