Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Deepspeedai DeepSpeed Pipeline Evaluation

From Leeroopedia


Overview

Evaluating pipeline-parallel models using an inference schedule that executes forward passes without backward computation, with optional logit return and loss reduction.

Detailed Description

Pipeline evaluation runs the model in inference mode across all pipeline stages. Unlike training, no backward passes are executed. The InferenceSchedule sends forward micro-batches through the pipeline and collects outputs (loss and/or logits) from the last stage. Results are optionally broadcast to all ranks and reduced (averaged) across micro-batches.

Evaluation vs. Training

Aspect Training (train_batch) Evaluation (eval_batch)
Schedule TrainSchedule (1F1B) InferenceSchedule (forward-only)
Backward passes Yes No
Gradient computation Enabled Disabled (torch.no_grad())
Optimizer step Yes No
Gradient reduction Yes (ReduceGrads, ReduceTiedGrads) No
Total steps 2 * (M + S - 1) M + S - 1
Return value Aggregated loss Loss (and optional logits)
Pipeline buffers min(S - stage_id, M) 2 (alternating)

InferenceSchedule Details

The InferenceSchedule executes M + S - 1 total steps (where M is micro-batches and S is stages). At each step:

  1. The first and last stages load micro-batches from the data iterator.
  2. Even and odd stages alternate send/recv ordering to avoid deadlocks.
  3. Only forward passes are executed — no backward passes, no gradient communication.
  4. The schedule uses only 2 pipeline buffers (alternating), since there are no concurrent forward and backward passes to manage.

Output Handling

After the schedule completes, the last stage holds the forward outputs for all micro-batches. The engine can:

  • Reduce outputs: Average the loss across micro-batches using _reduce_outputs() with configurable reduction ('avg' or None).
  • Average across data-parallel ranks: Allreduce the reduced loss across data-parallel groups.
  • Broadcast to all pipeline stages: Send the final loss from the last stage to all other stages via _bcast_pipe_scalar().
  • Return logits: Optionally return raw model outputs (logits) alongside the loss.

Checkpoint Saving for Pipeline Models

Pipeline evaluation is closely related to model checkpointing. The PipelineEngine overrides module_state_dict() to save per-stage layer state dicts rather than a single flat state dict. Each layer is saved as a separate file using save_state_dict(), enabling parallel writes across data-parallel ranks. Loading uses load_state_dir() to read per-layer checkpoint files.

Theoretical Basis

Pipeline inference schedule executes only forward passes through the pipeline stages. Without backward passes, the schedule is simpler — each micro-batch passes through all stages sequentially.

Comparison of Schedule Complexity

  • Training schedule: 2 * (M + S - 1) steps, interleaving forward and backward.
  • Inference schedule: M + S - 1 steps, forward-only. The first micro-batch takes S steps to propagate through all stages, and each subsequent micro-batch adds 1 step.

Buffer Efficiency

The inference schedule needs only 2 buffers because at any given time, each stage processes at most one micro-batch in the current step and has at most one pending from the previous step. The alternating buffer strategy (step_id % 2 and (step_id + 1) % 2) ensures no buffer conflicts.

Loss Reduction

The default reduction ('avg') computes:

  1. Sum losses across M micro-batches.
  2. Divide by M (average over micro-batches).
  3. Allreduce across data-parallel ranks and divide by data-parallel world size.

This yields the same expected loss as if the entire batch were processed on a single device.

Related Pages

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment