Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Microsoft DeepSpeedExamples Distributed Checkpoint Saving

From Leeroopedia


Metadata

Field Value
Page Type Principle
Title Distributed_Checkpoint_Saving
Repository Microsoft/DeepSpeedExamples
Domains Distributed_Training, Model_Serialization
Status Active
Related Implementation Implementation:Microsoft_DeepSpeedExamples_DeepSpeed_Save_Checkpoint

Overview

A technique for saving model checkpoints from distributed training environments where model parameters are partitioned across devices.

Description

In ZeRO Stage 3 training, model parameters are partitioned across all participating GPUs. Each GPU holds only 1/N of the total parameters, gradients, and optimizer states. This partitioning creates a fundamental challenge for checkpoint saving: no single device has the complete model.

DeepSpeed's save_checkpoint method handles this complexity automatically. It provides two approaches:

  • Partitioned checkpoint -- Each rank saves its own partition of the model state. This is the default behavior and is optimal for training resumption because it avoids the overhead of gathering all parameters.
  • Consolidated checkpoint -- All parameters are gathered to a single rank (typically rank 0) and saved as a complete model. This is needed for inference or deployment but requires additional memory and communication.

In addition to model weights, the checkpoint saving process should also save:

  • Tokenizer -- Using tokenizer.save_pretrained() to preserve the tokenizer configuration and vocabulary files alongside the model.
  • Optimizer states -- Saved automatically by DeepSpeed for training resumption.
  • Training state -- Step count, learning rate schedule state, and other training metadata.

Theoretical Basis

ZeRO-3 Parameter Distribution

With ZeRO Stage 3, each rank holds 1/N of parameters. The distribution follows this pattern:

Total parameters: P
Number of GPUs: N
Parameters per GPU: P/N (approximately, with padding for alignment)

When saving a checkpoint, the following scenarios arise:

Scenario Approach Memory Required Communication
Training resumption Partitioned save P/N per rank None (each rank saves locally)
Inference/deployment Consolidated save Full P on rank 0 All-gather across N ranks
Cross-framework export Consolidated + conversion Full P on rank 0 All-gather + format conversion

Rank-0 Coordination

Checkpoint saving is typically coordinated by rank 0:

  • Rank 0 initiates the save operation and creates the output directory.
  • All ranks participate in any necessary communication (e.g., parameter gathering).
  • Rank 0 (or all ranks for partitioned saves) writes the checkpoint files.

Using a rank check (dist.get_rank() == 0) before the save call ensures that only one process creates the directory and saves the tokenizer, while the DeepSpeed engine internally coordinates the model state saving across all ranks.

Checkpoint Contents

A DeepSpeed checkpoint directory typically contains:

File/Directory Description
mp_rank_00_model_states.pt Model parameter state dict (one per rank in partitioned mode)
zero_pp_rank_0_mp_rank_00_optim_states.pt Optimizer state (partitioned across ranks)
tokenizer.json Tokenizer configuration (from save_pretrained)
tokenizer_config.json Tokenizer settings
special_tokens_map.json Special token mappings
latest Pointer to the latest checkpoint tag

Checkpoint Saving Pattern

The recommended pattern for saving checkpoints in ZeRO-3 training:

  1. Check if saving is enabled and the current rank is 0.
  2. Create the output directory if it does not exist.
  3. Call model_engine.save_checkpoint(output_dir) to save model and optimizer states.
  4. Call tokenizer.save_pretrained(output_dir) to save tokenizer files.
  5. Handle any exceptions to prevent training failure from a save error.
if save_checkpoint and dist.get_rank() == 0:
    try:
        os.makedirs(output_dir, exist_ok=True)
        model_engine.save_checkpoint(output_dir)
        tokenizer.save_pretrained(output_dir)
    except Exception as e:
        logger.error(f"Error saving model: {e}")

Error Handling

Checkpoint saving is wrapped in a try-except block to prevent training failures due to I/O errors. Common failure modes include:

  • Disk space exhaustion -- Large models can produce checkpoints of 100+ GB.
  • Permission errors -- Output directory may not be writable.
  • NFS/distributed filesystem issues -- Network filesystem timeouts or locks.

By catching exceptions and logging the error, the training can continue even if the checkpoint save fails. This is particularly important for long-running jobs where losing training progress would be costly.

Usage Pattern

  1. Enable checkpoint saving via the --save_checkpoint command-line flag.
  2. Specify the output directory via --output_dir.
  3. After training completes, the save logic executes on rank 0.
  4. The saved checkpoint can be loaded for:
    • Training resumption -- Using model_engine.load_checkpoint()
    • Inference -- Using DeepSpeed's checkpoint conversion tools to produce a standard HuggingFace model

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment