Implementation:Deepspeedai DeepSpeed CPU IO Op
| Knowledge Sources | |
|---|---|
| Domains | Async_IO, NVMe_Offload |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
CPU-based I/O operation descriptor that handles tensor transfers between device memory (GPU/XPU) and NVMe storage with automatic bounce buffer management.
Description
The cpu_op_desc_t class extends the base io_op_desc_t to implement CPU-based asynchronous I/O operations on PyTorch tensors. A key feature is its intelligent bounce buffer management: it automatically detects whether a tensor is in pinned (page-locked) CPU memory and creates temporary bounce buffers when needed for efficient I/O. For GPU tensors, it copies data to a CPU bounce buffer before writing to NVMe, and after reading from NVMe, it copies data back to the GPU. This ensures optimal performance with O_DIRECT I/O while supporting tensors on any device (CPU, CUDA, XPU, NPU).
The implementation integrates with DeepSpeed's pinned tensor manager to reuse pre-allocated page-locked memory pools, reducing allocation overhead. It supports parallel execution across multiple threads, with each thread handling a portion of the total I/O.
Usage
This operation descriptor is used internally when the I/O handle performs operations on GPU tensors or non-pinned CPU tensors. It's automatically instantiated by the I/O handle's _create_io_op_desc method based on the tensor's device and memory properties.
Code Reference
Source Location
- Repository: DeepSpeed
- File: csrc/aio/py_lib/deepspeed_cpu_op.cpp
Signature
class cpu_op_desc_t : public io_op_desc_t {
cpu_op_desc_t(const std::unique_ptr<struct deepspeed_pin_tensor_t>& pinned_tensor_mgr,
const bool read_op,
const torch::Tensor& buffer,
const int fd,
const char* filename,
const int intra_op_parallelism,
const bool validate,
const int64_t file_offset);
char* data_ptr() const override;
void finish() override;
void validate() override;
void run(const int tid,
std::unique_ptr<aio_context>& aio_ctxt,
deepspeed_aio_config_t* aio_config) override;
};
Import
#include "deepspeed_cpu_op.h"
#include "deepspeed_pin_tensor.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pinned_tensor_mgr | std::unique_ptr<deepspeed_pin_tensor_t>& | Yes | Manager for pinned memory allocation |
| read_op | bool | Yes | True for read operations, false for write |
| buffer | torch::Tensor | Yes | PyTorch tensor on any device (CPU/GPU/XPU/NPU) |
| fd | int | Yes | File descriptor for I/O |
| filename | const char* | No | Filename for validation (can be null) |
| intra_op_parallelism | int | Yes | Number of parallel threads |
| validate | bool | Yes | Whether to validate operation correctness |
| file_offset | int64_t | Yes | Starting offset in file |
| tid | int | Yes | Thread ID for parallel execution (0 to parallelism-1) |
| aio_ctxt | std::unique_ptr<aio_context>& | Yes | AIO context for this thread |
| aio_config | deepspeed_aio_config_t* | Yes | AIO configuration |
Outputs
| Name | Type | Description |
|---|---|---|
| data_ptr | char* | Pointer to contiguous buffer (CPU memory) for I/O |
| buffer | torch::Tensor | Updated with read data (for read operations) |
| validation_result | void | Throws assertion on validation failure |
Usage Examples
// Create CPU operation descriptor (typically done by I/O handle)
auto gpu_tensor = torch::randn({1024*1024}, torch::TensorOptions().device(torch::kCUDA));
auto pinned_mgr = std::make_unique<deepspeed_pin_tensor_t>();
cpu_op_desc_t op_desc(pinned_mgr,
false, // write operation
gpu_tensor,
fd,
"/nvme/state.pt",
8, // 8-way parallelism
false, // no validation
0); // file offset
// Execute operation in worker thread
std::unique_ptr<aio_context> aio_ctxt(new aio_context(1024*1024, 128));
deepspeed_aio_config_t config(1024*1024, 128, false, true, false);
op_desc.run(0, aio_ctxt, &config); // Thread 0
// After all threads complete, finalize
op_desc.finish(); // Copies bounce buffer back to GPU for reads
// Example of automatic bounce buffer logic:
// - GPU tensor: Creates CPU bounce buffer, copies GPU->CPU before write
// - Pinned CPU tensor: Uses tensor directly, no bounce buffer
// - Regular CPU tensor: Creates pinned bounce buffer from pool