Implementation:Deepspeedai DeepSpeed DeepSpeedEngine Save For SP

Overview

Concrete tool for saving sequence-parallel model checkpoints and configuring evaluation bypass provided by the DeepSpeed library.

Description

DeepSpeedEngine.save_checkpoint() saves the model normally since sequence parallelism does not partition weights. The checkpoint contains complete model weights that can be loaded directly for inference or further training on any GPU configuration. The disable_in_eval flag on UlyssesSPAttentionHF bypasses SP all-to-all communication during evaluation, allowing simpler inference.

The save process follows the standard DeepSpeed checkpoint flow:

Validates the checkpoint tag across all ranks for consistency
Creates the checkpoint directory structure
Saves model state dict (complete weights, not SP-partitioned)
Saves optimizer state (partitioned by ZeRO stage across DP ranks)
Saves ZeRO checkpoint files if applicable
Writes a latest file pointing to the most recent checkpoint tag

Because SP only partitions activations and not weights, the model state dict saved by any rank within a DP group is identical and complete. This means no special checkpoint merging or conversion is needed for the model portion.

The disable_in_eval behavior is configured during UlyssesSPAttentionHF setup (either via register_with_transformers(disable_in_eval=True) or direct construction). At forward time, the check at line L255-257 of ulysses_sp.py determines whether to bypass SP:

if not module.training and self.disable_in_eval:
    return self.attn(module, query, key, value, attention_mask, *args, **kwargs)

Code Reference

Repository: https://github.com/deepspeedai/DeepSpeed
File: deepspeed/runtime/engine.py (L3695-3789, save_checkpoint), deepspeed/runtime/sequence_parallel/ulysses_sp.py (L255-258, disable_in_eval logic)

save_checkpoint Signature

def save_checkpoint(
    self,
    save_dir: str,
    tag: str = None,
    client_state: dict = {},
    save_latest: bool = True,
    exclude_frozen_parameters: bool = False,
) -> bool

Import

# Accessed via the DeepSpeed engine object
engine.save_checkpoint(save_dir, tag)

I/O Contract

Inputs

Parameter	Type	Required	Description
save_dir	str	Yes	Directory path for saving checkpoint files
tag	str	No	Unique identifier for the checkpoint; defaults to `global_step{N}`
client_state	dict	No	Additional training state to save (e.g., epoch, custom metrics)
save_latest	bool	No	Whether to write a `latest` file pointing to this checkpoint (default: `True`)
exclude_frozen_parameters	bool	No	Whether to exclude frozen parameters from the saved state (default: `False`)

Outputs

Output	Type	Description
success	bool	Returns `True` on successful save
Checkpoint files	files on disk	Complete model weights (no SP-specific conversion needed), optimizer state, and metadata

Evaluation Bypass Configuration

Parameter	Where Set	Description
disable_in_eval	`UlyssesSPAttentionHF.__init__` or `register_with_transformers()`	When `True`, SP all-to-all communication is skipped during `model.eval()`

Usage Example

# During training: save checkpoint (weights are complete, not sequence-partitioned)
engine.save_checkpoint("sp_checkpoints/", tag="step_10000")

# For single-GPU inference: load normally with HuggingFace
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("sp_checkpoints/step_10000")
# No SP-specific conversion needed

# Or evaluate with SP bypassed during training
# (requires disable_in_eval=True during register_with_transformers)
engine.eval()  # Sets module.training = False; SP all-to-all is skipped
with torch.no_grad():
    outputs = engine(eval_batch)

# Resume training
engine.train()  # SP all-to-all resumes

Related Pages

Principle:Deepspeedai_DeepSpeed_SP_Evaluation_Deployment

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment