Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Deepspeedai DeepSpeed CPU IO Op

From Leeroopedia


Knowledge Sources
Domains Async_IO, NVMe_Offload
Last Updated 2026-02-09 00:00 GMT

Overview

CPU-based I/O operation descriptor that handles tensor transfers between device memory (GPU/XPU) and NVMe storage with automatic bounce buffer management.

Description

The cpu_op_desc_t class extends the base io_op_desc_t to implement CPU-based asynchronous I/O operations on PyTorch tensors. A key feature is its intelligent bounce buffer management: it automatically detects whether a tensor is in pinned (page-locked) CPU memory and creates temporary bounce buffers when needed for efficient I/O. For GPU tensors, it copies data to a CPU bounce buffer before writing to NVMe, and after reading from NVMe, it copies data back to the GPU. This ensures optimal performance with O_DIRECT I/O while supporting tensors on any device (CPU, CUDA, XPU, NPU).

The implementation integrates with DeepSpeed's pinned tensor manager to reuse pre-allocated page-locked memory pools, reducing allocation overhead. It supports parallel execution across multiple threads, with each thread handling a portion of the total I/O.

Usage

This operation descriptor is used internally when the I/O handle performs operations on GPU tensors or non-pinned CPU tensors. It's automatically instantiated by the I/O handle's _create_io_op_desc method based on the tensor's device and memory properties.

Code Reference

Source Location

Signature

class cpu_op_desc_t : public io_op_desc_t {
    cpu_op_desc_t(const std::unique_ptr<struct deepspeed_pin_tensor_t>& pinned_tensor_mgr,
                  const bool read_op,
                  const torch::Tensor& buffer,
                  const int fd,
                  const char* filename,
                  const int intra_op_parallelism,
                  const bool validate,
                  const int64_t file_offset);

    char* data_ptr() const override;
    void finish() override;
    void validate() override;
    void run(const int tid,
             std::unique_ptr<aio_context>& aio_ctxt,
             deepspeed_aio_config_t* aio_config) override;
};

Import

#include "deepspeed_cpu_op.h"
#include "deepspeed_pin_tensor.h"

I/O Contract

Inputs

Name Type Required Description
pinned_tensor_mgr std::unique_ptr<deepspeed_pin_tensor_t>& Yes Manager for pinned memory allocation
read_op bool Yes True for read operations, false for write
buffer torch::Tensor Yes PyTorch tensor on any device (CPU/GPU/XPU/NPU)
fd int Yes File descriptor for I/O
filename const char* No Filename for validation (can be null)
intra_op_parallelism int Yes Number of parallel threads
validate bool Yes Whether to validate operation correctness
file_offset int64_t Yes Starting offset in file
tid int Yes Thread ID for parallel execution (0 to parallelism-1)
aio_ctxt std::unique_ptr<aio_context>& Yes AIO context for this thread
aio_config deepspeed_aio_config_t* Yes AIO configuration

Outputs

Name Type Description
data_ptr char* Pointer to contiguous buffer (CPU memory) for I/O
buffer torch::Tensor Updated with read data (for read operations)
validation_result void Throws assertion on validation failure

Usage Examples

// Create CPU operation descriptor (typically done by I/O handle)
auto gpu_tensor = torch::randn({1024*1024}, torch::TensorOptions().device(torch::kCUDA));
auto pinned_mgr = std::make_unique<deepspeed_pin_tensor_t>();

cpu_op_desc_t op_desc(pinned_mgr,
                      false,  // write operation
                      gpu_tensor,
                      fd,
                      "/nvme/state.pt",
                      8,  // 8-way parallelism
                      false,  // no validation
                      0);  // file offset

// Execute operation in worker thread
std::unique_ptr<aio_context> aio_ctxt(new aio_context(1024*1024, 128));
deepspeed_aio_config_t config(1024*1024, 128, false, true, false);
op_desc.run(0, aio_ctxt, &config);  // Thread 0

// After all threads complete, finalize
op_desc.finish();  // Copies bounce buffer back to GPU for reads

// Example of automatic bounce buffer logic:
// - GPU tensor: Creates CPU bounce buffer, copies GPU->CPU before write
// - Pinned CPU tensor: Uses tensor directly, no bounce buffer
// - Regular CPU tensor: Creates pinned bounce buffer from pool

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment