Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Deepspeedai DeepSpeed GDS Op

From Leeroopedia


Knowledge Sources
Domains Async_IO, NVMe_Offload, GPUDirect_Storage
Last Updated 2026-02-09 00:00 GMT

Overview

NVIDIA GPUDirect Storage (GDS) operation descriptor enabling direct data transfers between GPU memory and NVMe storage without CPU involvement.

Description

The gds_op_desc_t class implements I/O operations using NVIDIA's GPUDirect Storage technology, which allows GPU memory to directly access NVMe storage bypassing the CPU and system memory entirely. This eliminates CPU overhead and memory copies, providing significantly higher bandwidth and lower latency compared to traditional GPU->CPU->NVMe paths. The implementation uses NVIDIA's cuFile API to register GPU buffers, manage file handles, and perform direct read/write operations.

Key features include:

  • Buffer registry management: Tracks registered GPU buffers per device with automatic base pointer lookup
  • Zero-copy transfers: Data moves directly between GPU VRAM and NVMe without CPU bounce buffers
  • Multi-GPU support: Maintains separate buffer registries for each GPU device
  • Error handling: Comprehensive error reporting for cuFile operations with meaningful error messages

The implementation extends the base io_op_desc_t interface, allowing it to be used interchangeably with CPU-based operations through the same I/O handle interface.

Usage

Use GDS operations when you have NVIDIA GPUs with GPUDirect Storage support and want maximum I/O performance for large model checkpoints or optimizer states. GDS is particularly beneficial for large-scale training where checkpoint I/O time is a bottleneck. Requires appropriate hardware (NVIDIA DGX systems or certified storage), drivers (NVIDIA GPUDirect Storage), and NVMe configuration.

Code Reference

Source Location

Signature

class gds_op_desc_t : public io_op_desc_t {
    static void add_buffer_to_registry(const torch::Tensor& buffer);
    static void remove_buffer_from_registry(const torch::Tensor& buffer);

    gds_op_desc_t(const bool read_op,
                  const torch::Tensor& buffer,
                  const int fd,
                  const char* filename,
                  const int intra_op_parallelism,
                  const bool validate,
                  const int64_t file_offset);

    char* data_ptr() const override;
    void finish() override;
    void validate() override;
    void run(const int tid,
             std::unique_ptr<aio_context>& aio_ctxt,
             deepspeed_aio_config_t* aio_config) override;
};

Import

#include "deepspeed_gds_op.h"

I/O Contract

Inputs

Name Type Required Description
buffer torch::Tensor Yes GPU tensor to register or use for I/O
read_op bool Yes True for read, false for write
fd int Yes File descriptor for I/O
filename const char* No Filename for validation
intra_op_parallelism int Yes Number of parallel threads
validate bool Yes Whether to validate operation
file_offset int64_t Yes Starting offset in file
tid int Yes Thread ID (0 to parallelism-1)
aio_ctxt std::unique_ptr<aio_context>& Yes AIO context (unused for GDS)
aio_config deepspeed_aio_config_t* Yes AIO config (unused for GDS)

Outputs

Name Type Description
data_ptr char* Pointer to GPU buffer
buffer torch::Tensor Updated with read data (for read operations)
validation_result void Assertion failure on validation error
registry_status void Exit on registration/deregistration errors

Usage Examples

import torch
from deepspeed.ops.aio import aio_handle

# Create handle with GDS support
handle = aio_handle(
    block_size=4*1024*1024,  # 4MB blocks for optimal GDS performance
    queue_depth=32,
    single_submit=False,
    overlap_events=True,
    intra_op_parallelism=4
)

# Register GPU buffer for GDS (done internally)
gpu_tensor = torch.randn(1024, 1024, 1024).cuda()  # Large tensor on GPU

# Direct GPU-to-NVMe write (no CPU copy)
handle.async_pwrite(gpu_tensor, "/nvme/gpu_checkpoint.pt")

# Continue GPU computation while I/O happens
model.forward(inputs)
loss.backward()

# Wait for I/O completion
handle.wait()

# Direct NVMe-to-GPU read
gpu_buffer = torch.empty_like(gpu_tensor)
handle.async_pread(gpu_buffer, "/nvme/gpu_checkpoint.pt")
handle.wait()

# The tensor is now in GPU memory, no CPU staging needed
// C++ usage with buffer registration
auto gpu_tensor = torch::randn({1024, 1024, 1024},
                               torch::TensorOptions().device(torch::kCUDA));

// Register buffer for GDS
gds_op_desc_t::add_buffer_to_registry(gpu_tensor);

// Create GDS operation descriptor
gds_op_desc_t gds_op(false,  // write operation
                     gpu_tensor,
                     fd,
                     "/nvme/checkpoint.pt",
                     4,  // 4-way parallelism
                     false,
                     0);

// Execute direct GPU-to-NVMe transfer
std::unique_ptr<aio_context> ctx;  // Not used for GDS
deepspeed_aio_config_t cfg;        // Not used for GDS
gds_op.run(0, ctx, &cfg);

// Cleanup
gds_op.finish();
gds_op_desc_t::remove_buffer_from_registry(gpu_tensor);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment