Implementation:Deepspeedai DeepSpeed Py Copy

Knowledge Sources	DeepSpeed
Domains	Async_IO, NVMe_Offload
Last Updated	2026-02-09 00:00 GMT

Overview

Optimized memory copy operation for PyTorch tensors using SIMD instructions (AVX-512/AVX2) and OpenMP parallelization.

Description

This module implements deepspeed_py_memcpy, a high-performance tensor copying function that leverages CPU SIMD (Single Instruction Multiple Data) capabilities to achieve faster memory transfers than standard memcpy. The implementation uses a hierarchical approach with three helper functions (helper_memcpy_1, helper_memcpy_4, helper_mempcy_8) that process 1x, 4x, or 8x SIMD width chunks per iteration respectively. It tiles the data into chunks and uses OpenMP to parallelize the copying across multiple CPU cores.

The code is optimized for AVX-512 (16 floats per SIMD register) or AVX2 (8 floats per register) instruction sets. The tiled approach with 8-way SIMD unrolling reduces loop overhead and maximizes memory bandwidth utilization. Any remaining elements that don't fit into SIMD-aligned chunks are handled with scalar operations.

Usage

Use this function when you need to copy large tensors in CPU memory and want better performance than PyTorch's default copy operation. It's particularly useful during checkpoint loading/saving operations or when moving data between pinned and non-pinned CPU buffers before I/O operations.

Code Reference

Source Location

Repository: DeepSpeed
File: csrc/aio/py_lib/deepspeed_py_copy.cpp

Signature

int deepspeed_py_memcpy(torch::Tensor& dest, const torch::Tensor& src);

Import

#include "deepspeed_py_copy.h"

I/O Contract

Inputs

Name	Type	Required	Description
dest	torch::Tensor&	Yes	Destination tensor (will be modified in-place)
src	const torch::Tensor&	Yes	Source tensor to copy from

Outputs

Name	Type	Description
return_code	int	0 on success
dest	torch::Tensor	Updated with copied data from src

Usage Examples

import torch
from deepspeed.ops.aio import deepspeed_memcpy

# Create source and destination tensors
src = torch.randn(1024*1024, dtype=torch.float32)
dest = torch.empty_like(src)

# Fast SIMD-accelerated copy
deepspeed_memcpy(dest, src)

# Typical usage: copying to pinned buffer before I/O
regular_tensor = torch::randn(1024*1024)
pinned_tensor = io_handle.new_cpu_locked_tensor(1024*1024, regular_tensor)
deepspeed_memcpy(pinned_tensor, regular_tensor)
io_handle.async_pwrite(pinned_tensor, "/nvme/state.pt", 0)

// C++ usage
auto src = torch::randn({1024, 1024}, torch::kFloat32);
auto dest = torch::empty_like(src);

// Optimized copy
deepspeed_py_memcpy(dest, src);

// The function automatically:
// 1. Makes tensors contiguous
// 2. Uses 8x SIMD unrolling for bulk data
// 3. Falls back to 4x, then 1x for remaining data
// 4. Uses OpenMP to parallelize across cores
// 5. Handles non-aligned tail with scalar operations

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment