Principle:Microsoft DeepSpeedExamples Distributed Checkpoint Saving

Metadata

Field	Value
Page Type	Principle
Title	Distributed_Checkpoint_Saving
Repository	Microsoft/DeepSpeedExamples
Domains	Distributed_Training, Model_Serialization
Status	Active
Related Implementation	Implementation:Microsoft_DeepSpeedExamples_DeepSpeed_Save_Checkpoint

Overview

A technique for saving model checkpoints from distributed training environments where model parameters are partitioned across devices.

Description

In ZeRO Stage 3 training, model parameters are partitioned across all participating GPUs. Each GPU holds only 1/N of the total parameters, gradients, and optimizer states. This partitioning creates a fundamental challenge for checkpoint saving: no single device has the complete model.

DeepSpeed's save_checkpoint method handles this complexity automatically. It provides two approaches:

Partitioned checkpoint -- Each rank saves its own partition of the model state. This is the default behavior and is optimal for training resumption because it avoids the overhead of gathering all parameters.
Consolidated checkpoint -- All parameters are gathered to a single rank (typically rank 0) and saved as a complete model. This is needed for inference or deployment but requires additional memory and communication.

In addition to model weights, the checkpoint saving process should also save:

Tokenizer -- Using tokenizer.save_pretrained() to preserve the tokenizer configuration and vocabulary files alongside the model.
Optimizer states -- Saved automatically by DeepSpeed for training resumption.
Training state -- Step count, learning rate schedule state, and other training metadata.

Theoretical Basis

ZeRO-3 Parameter Distribution

With ZeRO Stage 3, each rank holds 1/N of parameters. The distribution follows this pattern:

Total parameters: P
Number of GPUs: N
Parameters per GPU: P/N (approximately, with padding for alignment)

When saving a checkpoint, the following scenarios arise:

Scenario	Approach	Memory Required	Communication
Training resumption	Partitioned save	P/N per rank	None (each rank saves locally)
Inference/deployment	Consolidated save	Full P on rank 0	All-gather across N ranks
Cross-framework export	Consolidated + conversion	Full P on rank 0	All-gather + format conversion

Rank-0 Coordination

Checkpoint saving is typically coordinated by rank 0:

Rank 0 initiates the save operation and creates the output directory.
All ranks participate in any necessary communication (e.g., parameter gathering).
Rank 0 (or all ranks for partitioned saves) writes the checkpoint files.

Using a rank check (dist.get_rank() == 0) before the save call ensures that only one process creates the directory and saves the tokenizer, while the DeepSpeed engine internally coordinates the model state saving across all ranks.

Checkpoint Contents

A DeepSpeed checkpoint directory typically contains:

File/Directory	Description
`mp_rank_00_model_states.pt`	Model parameter state dict (one per rank in partitioned mode)
`zero_pp_rank_0_mp_rank_00_optim_states.pt`	Optimizer state (partitioned across ranks)
`tokenizer.json`	Tokenizer configuration (from `save_pretrained`)
`tokenizer_config.json`	Tokenizer settings
`special_tokens_map.json`	Special token mappings
`latest`	Pointer to the latest checkpoint tag

Checkpoint Saving Pattern

The recommended pattern for saving checkpoints in ZeRO-3 training:

Check if saving is enabled and the current rank is 0.
Create the output directory if it does not exist.
Call model_engine.save_checkpoint(output_dir) to save model and optimizer states.
Call tokenizer.save_pretrained(output_dir) to save tokenizer files.
Handle any exceptions to prevent training failure from a save error.

if save_checkpoint and dist.get_rank() == 0:
    try:
        os.makedirs(output_dir, exist_ok=True)
        model_engine.save_checkpoint(output_dir)
        tokenizer.save_pretrained(output_dir)
    except Exception as e:
        logger.error(f"Error saving model: {e}")

Error Handling

Checkpoint saving is wrapped in a try-except block to prevent training failures due to I/O errors. Common failure modes include:

Disk space exhaustion -- Large models can produce checkpoints of 100+ GB.
Permission errors -- Output directory may not be writable.
NFS/distributed filesystem issues -- Network filesystem timeouts or locks.

By catching exceptions and logging the error, the training can continue even if the checkpoint save fails. This is particularly important for long-running jobs where losing training progress would be costly.

Usage Pattern

Enable checkpoint saving via the --save_checkpoint command-line flag.
Specify the output directory via --output_dir.
After training completes, the save logic executes on rank 0.
The saved checkpoint can be loaded for:
- Training resumption -- Using model_engine.load_checkpoint()
- Inference -- Using DeepSpeed's checkpoint conversion tools to produce a standard HuggingFace model

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment