Implementation:Deepspeedai DeepSpeed Py Copy
| Knowledge Sources | |
|---|---|
| Domains | Async_IO, NVMe_Offload |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Optimized memory copy operation for PyTorch tensors using SIMD instructions (AVX-512/AVX2) and OpenMP parallelization.
Description
This module implements deepspeed_py_memcpy, a high-performance tensor copying function that leverages CPU SIMD (Single Instruction Multiple Data) capabilities to achieve faster memory transfers than standard memcpy. The implementation uses a hierarchical approach with three helper functions (helper_memcpy_1, helper_memcpy_4, helper_mempcy_8) that process 1x, 4x, or 8x SIMD width chunks per iteration respectively. It tiles the data into chunks and uses OpenMP to parallelize the copying across multiple CPU cores.
The code is optimized for AVX-512 (16 floats per SIMD register) or AVX2 (8 floats per register) instruction sets. The tiled approach with 8-way SIMD unrolling reduces loop overhead and maximizes memory bandwidth utilization. Any remaining elements that don't fit into SIMD-aligned chunks are handled with scalar operations.
Usage
Use this function when you need to copy large tensors in CPU memory and want better performance than PyTorch's default copy operation. It's particularly useful during checkpoint loading/saving operations or when moving data between pinned and non-pinned CPU buffers before I/O operations.
Code Reference
Source Location
- Repository: DeepSpeed
- File: csrc/aio/py_lib/deepspeed_py_copy.cpp
Signature
int deepspeed_py_memcpy(torch::Tensor& dest, const torch::Tensor& src);
Import
#include "deepspeed_py_copy.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dest | torch::Tensor& | Yes | Destination tensor (will be modified in-place) |
| src | const torch::Tensor& | Yes | Source tensor to copy from |
Outputs
| Name | Type | Description |
|---|---|---|
| return_code | int | 0 on success |
| dest | torch::Tensor | Updated with copied data from src |
Usage Examples
import torch
from deepspeed.ops.aio import deepspeed_memcpy
# Create source and destination tensors
src = torch.randn(1024*1024, dtype=torch.float32)
dest = torch.empty_like(src)
# Fast SIMD-accelerated copy
deepspeed_memcpy(dest, src)
# Typical usage: copying to pinned buffer before I/O
regular_tensor = torch::randn(1024*1024)
pinned_tensor = io_handle.new_cpu_locked_tensor(1024*1024, regular_tensor)
deepspeed_memcpy(pinned_tensor, regular_tensor)
io_handle.async_pwrite(pinned_tensor, "/nvme/state.pt", 0)
// C++ usage
auto src = torch::randn({1024, 1024}, torch::kFloat32);
auto dest = torch::empty_like(src);
// Optimized copy
deepspeed_py_memcpy(dest, src);
// The function automatically:
// 1. Makes tensors contiguous
// 2. Uses 8x SIMD unrolling for bulk data
// 3. Falls back to 4x, then 1x for remaining data
// 4. Uses OpenMP to parallelize across cores
// 5. Handles non-aligned tail with scalar operations