Heuristic:Deepspeedai DeepSpeed ZeRO Pipeline Incompatibility
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Training, Optimization |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
ZeRO Stage 2 and Stage 3 are incompatible with Pipeline Parallelism; only ZeRO Stage 0 or 1 can be used with `PipelineEngine`.
Description
DeepSpeed's Pipeline Parallelism (`PipelineEngine`) and ZeRO optimizer stages 2 and 3 are fundamentally incompatible. Pipeline Parallelism manages its own gradient communication through a 1F1B (one forward, one backward) micro-batch schedule with point-to-point (P2P) inter-stage communication. ZeRO Stage 2 (gradient partitioning) and Stage 3 (parameter partitioning) both require control over gradient all-reduce operations, which conflicts with Pipeline Parallelism's custom communication schedule. The `PipelineEngine` explicitly disables backward all-reduce (`self.enable_backward_allreduce = False`) and manages its own gradient reduction.
Usage
Use this heuristic when designing a training configuration that combines model partitioning across stages (pipeline parallelism) with memory optimization. If you need both pipeline stages and ZeRO memory savings, you are limited to ZeRO Stage 0 (disabled) or ZeRO Stage 1 (optimizer state partitioning). For deeper memory optimization, consider using ZeRO Stage 3 with tensor parallelism instead of pipeline parallelism.
The Insight (Rule of Thumb)
- Action: When using `PipelineEngine`, set `"zero_optimization": {"stage": 0}` or `"zero_optimization": {"stage": 1}` in the DeepSpeed config.
- Value: ZeRO Stage 0 or 1 only.
- Trade-off: Losing ZeRO-2/3 memory savings when using pipeline parallelism. For large models, consider tensor parallelism + ZeRO-3 as an alternative.
- Compatibility: Elasticity is also not supported with pipeline parallelism.
Reasoning
Pipeline Parallelism implements a custom communication schedule (1F1B) where micro-batches are pipelined across stages. Each stage only communicates activations and gradients with adjacent stages via P2P operations. ZeRO Stage 2 requires all-reduce of gradient partitions across data-parallel ranks, and ZeRO Stage 3 requires all-gather of parameters before forward/backward passes. These collective operations conflict with Pipeline's P2P-only communication pattern. Additionally, BF16 gradient accumulation with pipeline parallelism is warned to be numerically unsafe with large accumulation steps.
The batch size relationship in pipeline parallelism is strictly enforced: `train_batch_size == micro_batch_size * gradient_accumulation_steps * data_parallel_size`
Code Evidence
Hard assertion from `deepspeed/runtime/pipe/engine.py:76-77`:
assert self.zero_optimization_stage(
) < ZeroStageEnum.gradients, "ZeRO-2 and ZeRO-3 are incompatible with pipeline parallelism"
Backward all-reduce disabled from `deepspeed/runtime/pipe/engine.py:80`:
# We schedule the all-reduces, so disable it in super().backward()
self.enable_backward_allreduce = False
Elasticity incompatibility from `deepspeed/runtime/pipe/engine.py:90-93`:
if self.elasticity_enabled():
if not self.is_elastic_model_parallel_supported():
assert not self.elasticity_enabled(), "Elasticity is not currently supported" \
" with pipeline parallelism."
BF16 warning from `deepspeed/runtime/engine.py:1542-1546`:
if model_dtype == torch.bfloat16 and self.pipeline_parallelism:
logger.warning(
"**** BF16 gradient accumulation is not safe numerically with large number "
"of accumulation steps, proceed with caution *****"
)