Implementation:Deepspeedai DeepSpeed GDS Op
| Knowledge Sources | |
|---|---|
| Domains | Async_IO, NVMe_Offload, GPUDirect_Storage |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
NVIDIA GPUDirect Storage (GDS) operation descriptor enabling direct data transfers between GPU memory and NVMe storage without CPU involvement.
Description
The gds_op_desc_t class implements I/O operations using NVIDIA's GPUDirect Storage technology, which allows GPU memory to directly access NVMe storage bypassing the CPU and system memory entirely. This eliminates CPU overhead and memory copies, providing significantly higher bandwidth and lower latency compared to traditional GPU->CPU->NVMe paths. The implementation uses NVIDIA's cuFile API to register GPU buffers, manage file handles, and perform direct read/write operations.
Key features include:
- Buffer registry management: Tracks registered GPU buffers per device with automatic base pointer lookup
- Zero-copy transfers: Data moves directly between GPU VRAM and NVMe without CPU bounce buffers
- Multi-GPU support: Maintains separate buffer registries for each GPU device
- Error handling: Comprehensive error reporting for cuFile operations with meaningful error messages
The implementation extends the base io_op_desc_t interface, allowing it to be used interchangeably with CPU-based operations through the same I/O handle interface.
Usage
Use GDS operations when you have NVIDIA GPUs with GPUDirect Storage support and want maximum I/O performance for large model checkpoints or optimizer states. GDS is particularly beneficial for large-scale training where checkpoint I/O time is a bottleneck. Requires appropriate hardware (NVIDIA DGX systems or certified storage), drivers (NVIDIA GPUDirect Storage), and NVMe configuration.
Code Reference
Source Location
- Repository: DeepSpeed
- File: csrc/gds/py_lib/deepspeed_gds_op.cpp
Signature
class gds_op_desc_t : public io_op_desc_t {
static void add_buffer_to_registry(const torch::Tensor& buffer);
static void remove_buffer_from_registry(const torch::Tensor& buffer);
gds_op_desc_t(const bool read_op,
const torch::Tensor& buffer,
const int fd,
const char* filename,
const int intra_op_parallelism,
const bool validate,
const int64_t file_offset);
char* data_ptr() const override;
void finish() override;
void validate() override;
void run(const int tid,
std::unique_ptr<aio_context>& aio_ctxt,
deepspeed_aio_config_t* aio_config) override;
};
Import
#include "deepspeed_gds_op.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| buffer | torch::Tensor | Yes | GPU tensor to register or use for I/O |
| read_op | bool | Yes | True for read, false for write |
| fd | int | Yes | File descriptor for I/O |
| filename | const char* | No | Filename for validation |
| intra_op_parallelism | int | Yes | Number of parallel threads |
| validate | bool | Yes | Whether to validate operation |
| file_offset | int64_t | Yes | Starting offset in file |
| tid | int | Yes | Thread ID (0 to parallelism-1) |
| aio_ctxt | std::unique_ptr<aio_context>& | Yes | AIO context (unused for GDS) |
| aio_config | deepspeed_aio_config_t* | Yes | AIO config (unused for GDS) |
Outputs
| Name | Type | Description |
|---|---|---|
| data_ptr | char* | Pointer to GPU buffer |
| buffer | torch::Tensor | Updated with read data (for read operations) |
| validation_result | void | Assertion failure on validation error |
| registry_status | void | Exit on registration/deregistration errors |
Usage Examples
import torch
from deepspeed.ops.aio import aio_handle
# Create handle with GDS support
handle = aio_handle(
block_size=4*1024*1024, # 4MB blocks for optimal GDS performance
queue_depth=32,
single_submit=False,
overlap_events=True,
intra_op_parallelism=4
)
# Register GPU buffer for GDS (done internally)
gpu_tensor = torch.randn(1024, 1024, 1024).cuda() # Large tensor on GPU
# Direct GPU-to-NVMe write (no CPU copy)
handle.async_pwrite(gpu_tensor, "/nvme/gpu_checkpoint.pt")
# Continue GPU computation while I/O happens
model.forward(inputs)
loss.backward()
# Wait for I/O completion
handle.wait()
# Direct NVMe-to-GPU read
gpu_buffer = torch.empty_like(gpu_tensor)
handle.async_pread(gpu_buffer, "/nvme/gpu_checkpoint.pt")
handle.wait()
# The tensor is now in GPU memory, no CPU staging needed
// C++ usage with buffer registration
auto gpu_tensor = torch::randn({1024, 1024, 1024},
torch::TensorOptions().device(torch::kCUDA));
// Register buffer for GDS
gds_op_desc_t::add_buffer_to_registry(gpu_tensor);
// Create GDS operation descriptor
gds_op_desc_t gds_op(false, // write operation
gpu_tensor,
fd,
"/nvme/checkpoint.pt",
4, // 4-way parallelism
false,
0);
// Execute direct GPU-to-NVMe transfer
std::unique_ptr<aio_context> ctx; // Not used for GDS
deepspeed_aio_config_t cfg; // Not used for GDS
gds_op.run(0, ctx, &cfg);
// Cleanup
gds_op.finish();
gds_op_desc_t::remove_buffer_from_registry(gpu_tensor);