Principle:Deepspeedai DeepSpeed Pipeline Evaluation
Overview
Evaluating pipeline-parallel models using an inference schedule that executes forward passes without backward computation, with optional logit return and loss reduction.
Detailed Description
Pipeline evaluation runs the model in inference mode across all pipeline stages. Unlike training, no backward passes are executed. The InferenceSchedule sends forward micro-batches through the pipeline and collects outputs (loss and/or logits) from the last stage. Results are optionally broadcast to all ranks and reduced (averaged) across micro-batches.
Evaluation vs. Training
| Aspect | Training (train_batch) | Evaluation (eval_batch) |
|---|---|---|
| Schedule | TrainSchedule (1F1B) | InferenceSchedule (forward-only) |
| Backward passes | Yes | No |
| Gradient computation | Enabled | Disabled (torch.no_grad())
|
| Optimizer step | Yes | No |
| Gradient reduction | Yes (ReduceGrads, ReduceTiedGrads) | No |
| Total steps | 2 * (M + S - 1) |
M + S - 1
|
| Return value | Aggregated loss | Loss (and optional logits) |
| Pipeline buffers | min(S - stage_id, M) |
2 (alternating) |
InferenceSchedule Details
The InferenceSchedule executes M + S - 1 total steps (where M is micro-batches and S is stages). At each step:
- The first and last stages load micro-batches from the data iterator.
- Even and odd stages alternate send/recv ordering to avoid deadlocks.
- Only forward passes are executed — no backward passes, no gradient communication.
- The schedule uses only 2 pipeline buffers (alternating), since there are no concurrent forward and backward passes to manage.
Output Handling
After the schedule completes, the last stage holds the forward outputs for all micro-batches. The engine can:
- Reduce outputs: Average the loss across micro-batches using
_reduce_outputs()with configurable reduction ('avg'orNone). - Average across data-parallel ranks: Allreduce the reduced loss across data-parallel groups.
- Broadcast to all pipeline stages: Send the final loss from the last stage to all other stages via
_bcast_pipe_scalar(). - Return logits: Optionally return raw model outputs (logits) alongside the loss.
Checkpoint Saving for Pipeline Models
Pipeline evaluation is closely related to model checkpointing. The PipelineEngine overrides module_state_dict() to save per-stage layer state dicts rather than a single flat state dict. Each layer is saved as a separate file using save_state_dict(), enabling parallel writes across data-parallel ranks. Loading uses load_state_dir() to read per-layer checkpoint files.
Theoretical Basis
Pipeline inference schedule executes only forward passes through the pipeline stages. Without backward passes, the schedule is simpler — each micro-batch passes through all stages sequentially.
Comparison of Schedule Complexity
- Training schedule:
2 * (M + S - 1)steps, interleaving forward and backward. - Inference schedule:
M + S - 1steps, forward-only. The first micro-batch takesSsteps to propagate through all stages, and each subsequent micro-batch adds 1 step.
Buffer Efficiency
The inference schedule needs only 2 buffers because at any given time, each stage processes at most one micro-batch in the current step and has at most one pending from the previous step. The alternating buffer strategy (step_id % 2 and (step_id + 1) % 2) ensures no buffer conflicts.
Loss Reduction
The default reduction ('avg') computes:
- Sum losses across M micro-batches.
- Divide by M (average over micro-batches).
- Allreduce across data-parallel ranks and divide by data-parallel world size.
This yields the same expected loss as if the entire batch were processed on a single device.
Related Pages
- Implementation:Deepspeedai_DeepSpeed_PipelineEngine_Eval_Batch
- Principle:Deepspeedai_DeepSpeed_Pipeline_Training_Schedule
- Principle:Deepspeedai_DeepSpeed_Pipeline_Engine_Init
Knowledge Sources
- https://github.com/deepspeedai/DeepSpeed
- https://www.deepspeed.ai/tutorials/pipeline/
- https://arxiv.org/abs/1811.06965
Last updated: 2026-02-09 00:00 GMT