Principle:Deepspeedai DeepSpeed Pipeline Training Schedule
Overview
The 1F1B (one-forward-one-backward) micro-batch scheduling algorithm that overlaps computation across pipeline stages to minimize idle time (pipeline bubble).
Detailed Description
Pipeline training uses a schedule to orchestrate micro-batch execution across stages. The 1F1B schedule first fills the pipeline with forward passes (warmup phase), then alternates forward and backward passes (steady state), and finally drains remaining backward passes (cooldown). Each micro-batch's activations are sent to the next stage via point-to-point communication. The train_batch() method executes one complete schedule across all micro-batches and returns the aggregated loss.
Schedule Phases
| Phase | Description | Operations |
|---|---|---|
| Warmup | Fill the pipeline with forward passes | Forward micro-batches enter the pipeline, activations flow from stage 0 to stage S-1 |
| Steady State | Alternate 1 forward and 1 backward per step | Each step processes one new forward micro-batch and one backward micro-batch, overlapping compute |
| Cooldown | Drain remaining backward passes | No new forward passes; remaining micro-batches complete their backward passes |
Instruction Types
The schedule generates sequences of PipeInstruction objects at each step:
- LoadMicroBatch: Load the next micro-batch from the data iterator into a pipeline buffer (first and last stages only).
- ForwardPass: Execute the local stage's forward computation on a buffer.
- BackwardPass: Execute the local stage's backward computation on a buffer.
- SendActivation: Send activations from a buffer to the next stage.
- RecvActivation: Receive activations from the previous stage into a buffer.
- SendGrad: Send gradients from a buffer to the previous stage.
- RecvGrad: Receive gradients from the next stage into a buffer.
- ReduceGrads: Allreduce gradients across data-parallel ranks (end of batch only).
- ReduceTiedGrads: Allreduce gradients of tied weights across pipeline stages (end of batch only).
- OptimizerStep: Execute the optimizer step (end of batch only).
Buffer Management
The schedule uses a cyclic buffer allocation strategy. Each stage needs at most min(S - stage_id, M) buffers (where S is total stages, M is micro-batches), with a minimum of 2. Earlier stages need more buffers because they have more in-flight micro-batches. The _buffer_idx() method maps micro-batch IDs to buffer indices using modular arithmetic.
Communication Ordering
To avoid deadlocks, even-numbered and odd-numbered stages alternate the order of send/recv operations:
- Even stages: Send first, then receive (for forward communication).
- Odd stages: Receive first, then send (for forward communication).
This ensures that each send on one stage is matched by a corresponding recv on the adjacent stage.
train_batch() Execution Flow
- Set the model to training mode.
- Create a
TrainSchedulewith the current micro-batch count and stage configuration. - Execute the schedule via
_exec_schedule(), which iterates over schedule steps and dispatches eachPipeInstructionto the corresponding handler method. - Aggregate the total loss: scale by gradient accumulation steps, average across data-parallel ranks, and broadcast to all pipeline stages.
- Return the aggregated loss.
Theoretical Basis
1F1B Schedule
The 1F1B schedule for S stages and M micro-batches operates as follows:
For the training schedule, total steps = 2 * (M + S - 1), covering both forward and backward passes. The schedule maps each step to a (micro_batch_id, is_forward) pair using the stage's parity (even/odd) to determine communication ordering.
Pipeline Bubble Analysis
- Total compute slots per stage:
2 * M(M forward + M backward) - Idle slots per stage:
2 * (S - 1) - Bubble ratio:
(S - 1) / (M + S - 1)
To keep bubble overhead below 10%, one needs M >= 9 * (S - 1). For example, with 4 stages, at least 27 micro-batches are needed.
Gradient Accumulation Equivalence
Pipeline parallelism with M micro-batches is mathematically equivalent to data-parallel training with gradient accumulation over M steps, followed by an optimizer step. The gradients from individual micro-batches are accumulated, and the optimizer step occurs only after all M micro-batches have completed their forward and backward passes.
References
- GPipe: https://arxiv.org/abs/1811.06965
- PipeDream: https://arxiv.org/abs/1806.03377
Related Pages
- Implementation:Deepspeedai_DeepSpeed_PipelineEngine_Train_Batch
- Principle:Deepspeedai_DeepSpeed_Pipeline_Engine_Init
- Principle:Deepspeedai_DeepSpeed_Pipeline_Evaluation
Knowledge Sources
- https://github.com/deepspeedai/DeepSpeed
- https://www.deepspeed.ai/tutorials/pipeline/
- https://arxiv.org/abs/1811.06965
- https://arxiv.org/abs/1806.03377
Last updated: 2026-02-09 00:00 GMT