Implementation:Deepspeedai DeepSpeed CPU IO Op

Knowledge Sources	DeepSpeed
Domains	Async_IO, NVMe_Offload
Last Updated	2026-02-09 00:00 GMT

Overview

CPU-based I/O operation descriptor that handles tensor transfers between device memory (GPU/XPU) and NVMe storage with automatic bounce buffer management.

Description

The cpu_op_desc_t class extends the base io_op_desc_t to implement CPU-based asynchronous I/O operations on PyTorch tensors. A key feature is its intelligent bounce buffer management: it automatically detects whether a tensor is in pinned (page-locked) CPU memory and creates temporary bounce buffers when needed for efficient I/O. For GPU tensors, it copies data to a CPU bounce buffer before writing to NVMe, and after reading from NVMe, it copies data back to the GPU. This ensures optimal performance with O_DIRECT I/O while supporting tensors on any device (CPU, CUDA, XPU, NPU).

The implementation integrates with DeepSpeed's pinned tensor manager to reuse pre-allocated page-locked memory pools, reducing allocation overhead. It supports parallel execution across multiple threads, with each thread handling a portion of the total I/O.

Usage

This operation descriptor is used internally when the I/O handle performs operations on GPU tensors or non-pinned CPU tensors. It's automatically instantiated by the I/O handle's _create_io_op_desc method based on the tensor's device and memory properties.

Code Reference

Source Location

Repository: DeepSpeed
File: csrc/aio/py_lib/deepspeed_cpu_op.cpp

Signature

class cpu_op_desc_t : public io_op_desc_t {
    cpu_op_desc_t(const std::unique_ptr<struct deepspeed_pin_tensor_t>& pinned_tensor_mgr,
                  const bool read_op,
                  const torch::Tensor& buffer,
                  const int fd,
                  const char* filename,
                  const int intra_op_parallelism,
                  const bool validate,
                  const int64_t file_offset);

    char* data_ptr() const override;
    void finish() override;
    void validate() override;
    void run(const int tid,
             std::unique_ptr<aio_context>& aio_ctxt,
             deepspeed_aio_config_t* aio_config) override;
};

Import

#include "deepspeed_cpu_op.h"
#include "deepspeed_pin_tensor.h"

I/O Contract

Inputs

Name	Type	Required	Description
pinned_tensor_mgr	std::unique_ptr<deepspeed_pin_tensor_t>&	Yes	Manager for pinned memory allocation
read_op	bool	Yes	True for read operations, false for write
buffer	torch::Tensor	Yes	PyTorch tensor on any device (CPU/GPU/XPU/NPU)
fd	int	Yes	File descriptor for I/O
filename	const char*	No	Filename for validation (can be null)
intra_op_parallelism	int	Yes	Number of parallel threads
validate	bool	Yes	Whether to validate operation correctness
file_offset	int64_t	Yes	Starting offset in file
tid	int	Yes	Thread ID for parallel execution (0 to parallelism-1)
aio_ctxt	std::unique_ptr<aio_context>&	Yes	AIO context for this thread
aio_config	deepspeed_aio_config_t*	Yes	AIO configuration

Outputs

Name	Type	Description
data_ptr	char*	Pointer to contiguous buffer (CPU memory) for I/O
buffer	torch::Tensor	Updated with read data (for read operations)
validation_result	void	Throws assertion on validation failure

Usage Examples

// Create CPU operation descriptor (typically done by I/O handle)
auto gpu_tensor = torch::randn({1024*1024}, torch::TensorOptions().device(torch::kCUDA));
auto pinned_mgr = std::make_unique<deepspeed_pin_tensor_t>();

cpu_op_desc_t op_desc(pinned_mgr,
                      false,  // write operation
                      gpu_tensor,
                      fd,
                      "/nvme/state.pt",
                      8,  // 8-way parallelism
                      false,  // no validation
                      0);  // file offset

// Execute operation in worker thread
std::unique_ptr<aio_context> aio_ctxt(new aio_context(1024*1024, 128));
deepspeed_aio_config_t config(1024*1024, 128, false, true, false);
op_desc.run(0, aio_ctxt, &config);  // Thread 0

// After all threads complete, finalize
op_desc.finish();  // Copies bounce buffer back to GPU for reads

// Example of automatic bounce buffer logic:
// - GPU tensor: Creates CPU bounce buffer, copies GPU->CPU before write
// - Pinned CPU tensor: Uses tensor directly, no bounce buffer
// - Regular CPU tensor: Creates pinned bounce buffer from pool

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment