Heuristic:Deepspeedai DeepSpeed ZeRO Pipeline Incompatibility

Knowledge Sources	DeepSpeed
Domains	Distributed_Training, Optimization
Last Updated	2026-02-09 00:00 GMT

Overview

ZeRO Stage 2 and Stage 3 are incompatible with Pipeline Parallelism; only ZeRO Stage 0 or 1 can be used with `PipelineEngine`.

Description

DeepSpeed's Pipeline Parallelism (`PipelineEngine`) and ZeRO optimizer stages 2 and 3 are fundamentally incompatible. Pipeline Parallelism manages its own gradient communication through a 1F1B (one forward, one backward) micro-batch schedule with point-to-point (P2P) inter-stage communication. ZeRO Stage 2 (gradient partitioning) and Stage 3 (parameter partitioning) both require control over gradient all-reduce operations, which conflicts with Pipeline Parallelism's custom communication schedule. The `PipelineEngine` explicitly disables backward all-reduce (`self.enable_backward_allreduce = False`) and manages its own gradient reduction.

Usage

Use this heuristic when designing a training configuration that combines model partitioning across stages (pipeline parallelism) with memory optimization. If you need both pipeline stages and ZeRO memory savings, you are limited to ZeRO Stage 0 (disabled) or ZeRO Stage 1 (optimizer state partitioning). For deeper memory optimization, consider using ZeRO Stage 3 with tensor parallelism instead of pipeline parallelism.

The Insight (Rule of Thumb)

Action: When using `PipelineEngine`, set `"zero_optimization": {"stage": 0}` or `"zero_optimization": {"stage": 1}` in the DeepSpeed config.
Value: ZeRO Stage 0 or 1 only.
Trade-off: Losing ZeRO-2/3 memory savings when using pipeline parallelism. For large models, consider tensor parallelism + ZeRO-3 as an alternative.
Compatibility: Elasticity is also not supported with pipeline parallelism.

Reasoning

Pipeline Parallelism implements a custom communication schedule (1F1B) where micro-batches are pipelined across stages. Each stage only communicates activations and gradients with adjacent stages via P2P operations. ZeRO Stage 2 requires all-reduce of gradient partitions across data-parallel ranks, and ZeRO Stage 3 requires all-gather of parameters before forward/backward passes. These collective operations conflict with Pipeline's P2P-only communication pattern. Additionally, BF16 gradient accumulation with pipeline parallelism is warned to be numerically unsafe with large accumulation steps.

The batch size relationship in pipeline parallelism is strictly enforced: `train_batch_size == micro_batch_size * gradient_accumulation_steps * data_parallel_size`

Code Evidence

Hard assertion from `deepspeed/runtime/pipe/engine.py:76-77`:

assert self.zero_optimization_stage(
) < ZeroStageEnum.gradients, "ZeRO-2 and ZeRO-3 are incompatible with pipeline parallelism"

Backward all-reduce disabled from `deepspeed/runtime/pipe/engine.py:80`:

# We schedule the all-reduces, so disable it in super().backward()
self.enable_backward_allreduce = False

Elasticity incompatibility from `deepspeed/runtime/pipe/engine.py:90-93`:

if self.elasticity_enabled():
    if not self.is_elastic_model_parallel_supported():
        assert not self.elasticity_enabled(), "Elasticity is not currently supported" \
        " with pipeline parallelism."

BF16 warning from `deepspeed/runtime/engine.py:1542-1546`:

if model_dtype == torch.bfloat16 and self.pipeline_parallelism:
    logger.warning(
        "**** BF16 gradient accumulation is not safe numerically with large number "
        "of accumulation steps, proceed with caution *****"
    )

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment