Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Deepspeedai DeepSpeed DeepSpeedEngine Save For SP

From Leeroopedia


Overview

Concrete tool for saving sequence-parallel model checkpoints and configuring evaluation bypass provided by the DeepSpeed library.

Description

DeepSpeedEngine.save_checkpoint() saves the model normally since sequence parallelism does not partition weights. The checkpoint contains complete model weights that can be loaded directly for inference or further training on any GPU configuration. The disable_in_eval flag on UlyssesSPAttentionHF bypasses SP all-to-all communication during evaluation, allowing simpler inference.

The save process follows the standard DeepSpeed checkpoint flow:

  1. Validates the checkpoint tag across all ranks for consistency
  2. Creates the checkpoint directory structure
  3. Saves model state dict (complete weights, not SP-partitioned)
  4. Saves optimizer state (partitioned by ZeRO stage across DP ranks)
  5. Saves ZeRO checkpoint files if applicable
  6. Writes a latest file pointing to the most recent checkpoint tag

Because SP only partitions activations and not weights, the model state dict saved by any rank within a DP group is identical and complete. This means no special checkpoint merging or conversion is needed for the model portion.

The disable_in_eval behavior is configured during UlyssesSPAttentionHF setup (either via register_with_transformers(disable_in_eval=True) or direct construction). At forward time, the check at line L255-257 of ulysses_sp.py determines whether to bypass SP:

if not module.training and self.disable_in_eval:
    return self.attn(module, query, key, value, attention_mask, *args, **kwargs)

Code Reference

  • Repository: https://github.com/deepspeedai/DeepSpeed
  • File: deepspeed/runtime/engine.py (L3695-3789, save_checkpoint), deepspeed/runtime/sequence_parallel/ulysses_sp.py (L255-258, disable_in_eval logic)

save_checkpoint Signature

def save_checkpoint(
    self,
    save_dir: str,
    tag: str = None,
    client_state: dict = {},
    save_latest: bool = True,
    exclude_frozen_parameters: bool = False,
) -> bool

Import

# Accessed via the DeepSpeed engine object
engine.save_checkpoint(save_dir, tag)

I/O Contract

Inputs

Parameter Type Required Description
save_dir str Yes Directory path for saving checkpoint files
tag str No Unique identifier for the checkpoint; defaults to global_step{N}
client_state dict No Additional training state to save (e.g., epoch, custom metrics)
save_latest bool No Whether to write a latest file pointing to this checkpoint (default: True)
exclude_frozen_parameters bool No Whether to exclude frozen parameters from the saved state (default: False)

Outputs

Output Type Description
success bool Returns True on successful save
Checkpoint files files on disk Complete model weights (no SP-specific conversion needed), optimizer state, and metadata

Evaluation Bypass Configuration

Parameter Where Set Description
disable_in_eval UlyssesSPAttentionHF.__init__ or register_with_transformers() When True, SP all-to-all communication is skipped during model.eval()

Usage Example

# During training: save checkpoint (weights are complete, not sequence-partitioned)
engine.save_checkpoint("sp_checkpoints/", tag="step_10000")

# For single-GPU inference: load normally with HuggingFace
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("sp_checkpoints/step_10000")
# No SP-specific conversion needed

# Or evaluate with SP bypassed during training
# (requires disable_in_eval=True during register_with_transformers)
engine.eval()  # Sets module.training = False; SP all-to-all is skipped
with torch.no_grad():
    outputs = engine(eval_batch)

# Resume training
engine.train()  # SP all-to-all resumes

Related Pages

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment