Principle:Deepspeedai DeepSpeed SP Evaluation Deployment
Overview
Evaluating sequence-parallel models with optional bypass of SP communication during evaluation, and saving checkpoints for deployment on different GPU configurations.
Detailed Description
For evaluation, the disable_in_eval=True parameter in UlyssesSPAttentionHF skips all-to-all communication when the model is in eval mode, enabling simpler single-GPU inference without SP overhead. For deployment, standard save_checkpoint() saves the model weights (which are not sequence-partitioned -- only activations are partitioned during SP). This means SP checkpoints are directly usable without any conversion step.
Evaluation Bypass
When disable_in_eval=True is set during register_with_transformers() or direct UlyssesSPAttentionHF construction, the forward method checks module.training at the start of each call. If the module is not in training mode, the original attention function is called directly without any all-to-all communication, rearrangement, or shape assertions. This is particularly important for frameworks like HuggingFace Trainer that may run evaluation with different data distribution assumptions than what SP expects.
The bypass is a simple conditional check:
if not module.training and self.disable_in_eval: return self.attn(module, query, key, value, attention_mask, *args, **kwargs)
Checkpoint Compatibility
Unlike tensor parallelism (where model weights are sharded across GPUs), sequence parallelism only partitions activations during the forward and backward passes. The model weights remain complete and identical on every rank within a data-parallel group. This means:
- Checkpoints saved during SP training contain full model weights
- No checkpoint conversion or consolidation step is required for deployment
- A checkpoint from SP training can be loaded directly for single-GPU inference
- The checkpoint can be loaded into a different SP configuration (different
sp_size) without modification
Theoretical Basis
In sequence parallelism, the weight tensors W are not partitioned:
- Every GPU in an SP group holds the same complete copy of W
- Only the activations (input tensors, hidden states, attention QKV, etc.) are partitioned on the sequence dimension
- Gradient allreduce across the data-parallel dimension ensures weight updates are synchronized
This is in contrast to tensor parallelism, where W is split across GPUs (e.g., column-parallel or row-parallel), requiring special checkpoint handling to reconstruct the full weights.
Therefore:
save_checkpoint()saves the standard model state dict, which contains complete weightsload_checkpoint()or HuggingFace'sfrom_pretrained()can load these weights directly- No
zero_to_fp32.pyconversion is needed for the model weights themselves (though ZeRO optimizer states may still need consolidation depending on the ZeRO stage)
Related Pages
Knowledge Sources
- https://github.com/deepspeedai/DeepSpeed
- https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/
- https://arxiv.org/abs/2309.14509
Last updated: 2026-02-09 00:00 GMT