Implementation:Deepspeedai DeepSpeed GDS Op

Knowledge Sources	DeepSpeed
Domains	Async_IO, NVMe_Offload, GPUDirect_Storage
Last Updated	2026-02-09 00:00 GMT

Overview

NVIDIA GPUDirect Storage (GDS) operation descriptor enabling direct data transfers between GPU memory and NVMe storage without CPU involvement.

Description

The gds_op_desc_t class implements I/O operations using NVIDIA's GPUDirect Storage technology, which allows GPU memory to directly access NVMe storage bypassing the CPU and system memory entirely. This eliminates CPU overhead and memory copies, providing significantly higher bandwidth and lower latency compared to traditional GPU->CPU->NVMe paths. The implementation uses NVIDIA's cuFile API to register GPU buffers, manage file handles, and perform direct read/write operations.

Key features include:

Buffer registry management: Tracks registered GPU buffers per device with automatic base pointer lookup
Zero-copy transfers: Data moves directly between GPU VRAM and NVMe without CPU bounce buffers
Multi-GPU support: Maintains separate buffer registries for each GPU device
Error handling: Comprehensive error reporting for cuFile operations with meaningful error messages

The implementation extends the base io_op_desc_t interface, allowing it to be used interchangeably with CPU-based operations through the same I/O handle interface.

Usage

Use GDS operations when you have NVIDIA GPUs with GPUDirect Storage support and want maximum I/O performance for large model checkpoints or optimizer states. GDS is particularly beneficial for large-scale training where checkpoint I/O time is a bottleneck. Requires appropriate hardware (NVIDIA DGX systems or certified storage), drivers (NVIDIA GPUDirect Storage), and NVMe configuration.

Code Reference

Source Location

Repository: DeepSpeed
File: csrc/gds/py_lib/deepspeed_gds_op.cpp

Signature

class gds_op_desc_t : public io_op_desc_t {
    static void add_buffer_to_registry(const torch::Tensor& buffer);
    static void remove_buffer_from_registry(const torch::Tensor& buffer);

    gds_op_desc_t(const bool read_op,
                  const torch::Tensor& buffer,
                  const int fd,
                  const char* filename,
                  const int intra_op_parallelism,
                  const bool validate,
                  const int64_t file_offset);

    char* data_ptr() const override;
    void finish() override;
    void validate() override;
    void run(const int tid,
             std::unique_ptr<aio_context>& aio_ctxt,
             deepspeed_aio_config_t* aio_config) override;
};

Import

#include "deepspeed_gds_op.h"

I/O Contract

Inputs

Name	Type	Required	Description
buffer	torch::Tensor	Yes	GPU tensor to register or use for I/O
read_op	bool	Yes	True for read, false for write
fd	int	Yes	File descriptor for I/O
filename	const char*	No	Filename for validation
intra_op_parallelism	int	Yes	Number of parallel threads
validate	bool	Yes	Whether to validate operation
file_offset	int64_t	Yes	Starting offset in file
tid	int	Yes	Thread ID (0 to parallelism-1)
aio_ctxt	std::unique_ptr<aio_context>&	Yes	AIO context (unused for GDS)
aio_config	deepspeed_aio_config_t*	Yes	AIO config (unused for GDS)

Outputs

Name	Type	Description
data_ptr	char*	Pointer to GPU buffer
buffer	torch::Tensor	Updated with read data (for read operations)
validation_result	void	Assertion failure on validation error
registry_status	void	Exit on registration/deregistration errors

Usage Examples

import torch
from deepspeed.ops.aio import aio_handle

# Create handle with GDS support
handle = aio_handle(
    block_size=4*1024*1024,  # 4MB blocks for optimal GDS performance
    queue_depth=32,
    single_submit=False,
    overlap_events=True,
    intra_op_parallelism=4
)

# Register GPU buffer for GDS (done internally)
gpu_tensor = torch.randn(1024, 1024, 1024).cuda()  # Large tensor on GPU

# Direct GPU-to-NVMe write (no CPU copy)
handle.async_pwrite(gpu_tensor, "/nvme/gpu_checkpoint.pt")

# Continue GPU computation while I/O happens
model.forward(inputs)
loss.backward()

# Wait for I/O completion
handle.wait()

# Direct NVMe-to-GPU read
gpu_buffer = torch.empty_like(gpu_tensor)
handle.async_pread(gpu_buffer, "/nvme/gpu_checkpoint.pt")
handle.wait()

# The tensor is now in GPU memory, no CPU staging needed

// C++ usage with buffer registration
auto gpu_tensor = torch::randn({1024, 1024, 1024},
                               torch::TensorOptions().device(torch::kCUDA));

// Register buffer for GDS
gds_op_desc_t::add_buffer_to_registry(gpu_tensor);

// Create GDS operation descriptor
gds_op_desc_t gds_op(false,  // write operation
                     gpu_tensor,
                     fd,
                     "/nvme/checkpoint.pt",
                     4,  // 4-way parallelism
                     false,
                     0);

// Execute direct GPU-to-NVMe transfer
std::unique_ptr<aio_context> ctx;  // Not used for GDS
deepspeed_aio_config_t cfg;        // Not used for GDS
gds_op.run(0, ctx, &cfg);

// Cleanup
gds_op.finish();
gds_op_desc_t::remove_buffer_from_registry(gpu_tensor);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment