Principle:Microsoft DeepSpeedExamples Distributed Checkpoint Saving
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Distributed_Checkpoint_Saving |
| Repository | Microsoft/DeepSpeedExamples |
| Domains | Distributed_Training, Model_Serialization |
| Status | Active |
| Related Implementation | Implementation:Microsoft_DeepSpeedExamples_DeepSpeed_Save_Checkpoint |
Overview
A technique for saving model checkpoints from distributed training environments where model parameters are partitioned across devices.
Description
In ZeRO Stage 3 training, model parameters are partitioned across all participating GPUs. Each GPU holds only 1/N of the total parameters, gradients, and optimizer states. This partitioning creates a fundamental challenge for checkpoint saving: no single device has the complete model.
DeepSpeed's save_checkpoint method handles this complexity automatically. It provides two approaches:
- Partitioned checkpoint -- Each rank saves its own partition of the model state. This is the default behavior and is optimal for training resumption because it avoids the overhead of gathering all parameters.
- Consolidated checkpoint -- All parameters are gathered to a single rank (typically rank 0) and saved as a complete model. This is needed for inference or deployment but requires additional memory and communication.
In addition to model weights, the checkpoint saving process should also save:
- Tokenizer -- Using
tokenizer.save_pretrained()to preserve the tokenizer configuration and vocabulary files alongside the model. - Optimizer states -- Saved automatically by DeepSpeed for training resumption.
- Training state -- Step count, learning rate schedule state, and other training metadata.
Theoretical Basis
ZeRO-3 Parameter Distribution
With ZeRO Stage 3, each rank holds 1/N of parameters. The distribution follows this pattern:
Total parameters: P Number of GPUs: N Parameters per GPU: P/N (approximately, with padding for alignment)
When saving a checkpoint, the following scenarios arise:
| Scenario | Approach | Memory Required | Communication |
|---|---|---|---|
| Training resumption | Partitioned save | P/N per rank | None (each rank saves locally) |
| Inference/deployment | Consolidated save | Full P on rank 0 | All-gather across N ranks |
| Cross-framework export | Consolidated + conversion | Full P on rank 0 | All-gather + format conversion |
Rank-0 Coordination
Checkpoint saving is typically coordinated by rank 0:
- Rank 0 initiates the save operation and creates the output directory.
- All ranks participate in any necessary communication (e.g., parameter gathering).
- Rank 0 (or all ranks for partitioned saves) writes the checkpoint files.
Using a rank check (dist.get_rank() == 0) before the save call ensures that only one process creates the directory and saves the tokenizer, while the DeepSpeed engine internally coordinates the model state saving across all ranks.
Checkpoint Contents
A DeepSpeed checkpoint directory typically contains:
| File/Directory | Description |
|---|---|
mp_rank_00_model_states.pt |
Model parameter state dict (one per rank in partitioned mode) |
zero_pp_rank_0_mp_rank_00_optim_states.pt |
Optimizer state (partitioned across ranks) |
tokenizer.json |
Tokenizer configuration (from save_pretrained)
|
tokenizer_config.json |
Tokenizer settings |
special_tokens_map.json |
Special token mappings |
latest |
Pointer to the latest checkpoint tag |
Checkpoint Saving Pattern
The recommended pattern for saving checkpoints in ZeRO-3 training:
- Check if saving is enabled and the current rank is 0.
- Create the output directory if it does not exist.
- Call
model_engine.save_checkpoint(output_dir)to save model and optimizer states. - Call
tokenizer.save_pretrained(output_dir)to save tokenizer files. - Handle any exceptions to prevent training failure from a save error.
if save_checkpoint and dist.get_rank() == 0:
try:
os.makedirs(output_dir, exist_ok=True)
model_engine.save_checkpoint(output_dir)
tokenizer.save_pretrained(output_dir)
except Exception as e:
logger.error(f"Error saving model: {e}")
Error Handling
Checkpoint saving is wrapped in a try-except block to prevent training failures due to I/O errors. Common failure modes include:
- Disk space exhaustion -- Large models can produce checkpoints of 100+ GB.
- Permission errors -- Output directory may not be writable.
- NFS/distributed filesystem issues -- Network filesystem timeouts or locks.
By catching exceptions and logging the error, the training can continue even if the checkpoint save fails. This is particularly important for long-running jobs where losing training progress would be costly.
Usage Pattern
- Enable checkpoint saving via the
--save_checkpointcommand-line flag. - Specify the output directory via
--output_dir. - After training completes, the save logic executes on rank 0.
- The saved checkpoint can be loaded for:
- Training resumption -- Using
model_engine.load_checkpoint() - Inference -- Using DeepSpeed's checkpoint conversion tools to produce a standard HuggingFace model
- Training resumption -- Using