Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft DeepSpeedExamples Fast Torch Serialization

From Leeroopedia
Revision as of 15:41, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Microsoft_DeepSpeedExamples_Fast_Torch_Serialization.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Deep Learning, Checkpointing
Last Updated 2026-02-07 12:00 GMT

Overview

Patched version of PyTorch 2.6.0 serialization module with FastPersist optimizations for accelerated model checkpoint writing via DeepNVMe.

Description

serialization_fast_v2.6.0.py is a modified copy of PyTorch's torch.serialization module that enables DeepSpeed FastPersist integration for high-throughput NVMe and GDS (GPU Direct Storage) writes during model checkpointing. The module provides the standard save() and load() functions used by PyTorch for tensor serialization, along with supporting utilities for endianness control, CRC32 options, memory-mapped loading, and safe globals management.

The key difference from the original PyTorch serialization is in the save() function's storage writing path. When the file object has a save_torch_storage_object_list method (indicating a FastFileWriter handle), the module batches all storage objects together and writes them in a single optimized call rather than writing each storage individually. This batched approach enables the DeepNVMe backend to perform direct NVMe writes with optimal throughput, achieving 25X+ speedup over standard filesystem writes.

The module also provides thread-local state management via _SerializationLocal for map_location propagation, skip_data support for metadata-only saves, and fake tensor materialization. It supports both the legacy pickle-based format and the modern zipfile-based serialization format introduced in PyTorch 1.6.

Usage

This module is used as a drop-in replacement for torch.serialization when FastPersist checkpointing is enabled. It is transparently swapped in by the DeepNVMe model checkpoint infrastructure to accelerate checkpoint saves without requiring changes to user code that calls torch.save().

Code Reference

Source Location

Signature

def save(
    obj: object,
    f: FILE_LIKE,
    pickle_module: Any = pickle,
    pickle_protocol: int = DEFAULT_PROTOCOL,
    _use_new_zipfile_serialization: bool = True,
    _disable_byteorder_record: bool = False,
) -> None:
    ...

def load(
    f: FILE_LIKE,
    map_location: MAP_LOCATION = None,
    pickle_module: Any = None,
    *,
    weights_only: Optional[bool] = None,
    mmap: Optional[bool] = None,
    **pickle_load_args: Any,
) -> Any:
    ...

Import

from deepnvme.model_checkpoint.torch.serialization_fast_v2_6_0 import save, load

I/O Contract

Inputs

Name Type Required Description
obj object Yes The Python object to serialize (typically a model state_dict or tensor)
f FILE_LIKE Yes File path (str/PathLike) or file-like object with write capability; may be a FastFileWriter for NVMe acceleration
pickle_module Any No Module used for pickling metadata (default: pickle)
pickle_protocol int No Protocol version for pickle (default: 2)
_use_new_zipfile_serialization bool No Use zipfile-based format (default: True)

Outputs

Name Type Description
(save) None Writes serialized object to the specified file
(load) Any The deserialized Python object (typically dict, Tensor, or Module state)

Usage Examples

Saving a Model Checkpoint with FastPersist

import torch
from deepnvme.model_checkpoint.torch import serialization_fast_v2_6_0 as fast_serial

# Standard usage - transparent acceleration when f supports FastFileWriter
model_state = model.state_dict()
fast_serial.save(model_state, "/mnt/nvme/checkpoint.pt")

# Loading remains standard
state_dict = fast_serial.load("/mnt/nvme/checkpoint.pt", map_location="cpu")
model.load_state_dict(state_dict)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment