Principle:Deepspeedai DeepSpeed SP Evaluation Deployment

Overview

Evaluating sequence-parallel models with optional bypass of SP communication during evaluation, and saving checkpoints for deployment on different GPU configurations.

Detailed Description

For evaluation, the disable_in_eval=True parameter in UlyssesSPAttentionHF skips all-to-all communication when the model is in eval mode, enabling simpler single-GPU inference without SP overhead. For deployment, standard save_checkpoint() saves the model weights (which are not sequence-partitioned -- only activations are partitioned during SP). This means SP checkpoints are directly usable without any conversion step.

Evaluation Bypass

When disable_in_eval=True is set during register_with_transformers() or direct UlyssesSPAttentionHF construction, the forward method checks module.training at the start of each call. If the module is not in training mode, the original attention function is called directly without any all-to-all communication, rearrangement, or shape assertions. This is particularly important for frameworks like HuggingFace Trainer that may run evaluation with different data distribution assumptions than what SP expects.

The bypass is a simple conditional check:

if not module.training and self.disable_in_eval: return self.attn(module, query, key, value, attention_mask, *args, **kwargs)

Checkpoint Compatibility

Unlike tensor parallelism (where model weights are sharded across GPUs), sequence parallelism only partitions activations during the forward and backward passes. The model weights remain complete and identical on every rank within a data-parallel group. This means:

Checkpoints saved during SP training contain full model weights
No checkpoint conversion or consolidation step is required for deployment
A checkpoint from SP training can be loaded directly for single-GPU inference
The checkpoint can be loaded into a different SP configuration (different sp_size) without modification

Theoretical Basis

In sequence parallelism, the weight tensors W are not partitioned:

Every GPU in an SP group holds the same complete copy of W
Only the activations (input tensors, hidden states, attention QKV, etc.) are partitioned on the sequence dimension
Gradient allreduce across the data-parallel dimension ensures weight updates are synchronized

This is in contrast to tensor parallelism, where W is split across GPUs (e.g., column-parallel or row-parallel), requiring special checkpoint handling to reconstruct the full weights.

Therefore:

save_checkpoint() saves the standard model state dict, which contains complete weights
load_checkpoint() or HuggingFace's from_pretrained() can load these weights directly
No zero_to_fp32.py conversion is needed for the model weights themselves (though ZeRO optimizer states may still need consolidation depending on the ZeRO stage)

Related Pages

Implementation:Deepspeedai_DeepSpeed_DeepSpeedEngine_Save_For_SP

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment