Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Deepspeedai DeepSpeed Py Copy

From Leeroopedia


Knowledge Sources
Domains Async_IO, NVMe_Offload
Last Updated 2026-02-09 00:00 GMT

Overview

Optimized memory copy operation for PyTorch tensors using SIMD instructions (AVX-512/AVX2) and OpenMP parallelization.

Description

This module implements deepspeed_py_memcpy, a high-performance tensor copying function that leverages CPU SIMD (Single Instruction Multiple Data) capabilities to achieve faster memory transfers than standard memcpy. The implementation uses a hierarchical approach with three helper functions (helper_memcpy_1, helper_memcpy_4, helper_mempcy_8) that process 1x, 4x, or 8x SIMD width chunks per iteration respectively. It tiles the data into chunks and uses OpenMP to parallelize the copying across multiple CPU cores.

The code is optimized for AVX-512 (16 floats per SIMD register) or AVX2 (8 floats per register) instruction sets. The tiled approach with 8-way SIMD unrolling reduces loop overhead and maximizes memory bandwidth utilization. Any remaining elements that don't fit into SIMD-aligned chunks are handled with scalar operations.

Usage

Use this function when you need to copy large tensors in CPU memory and want better performance than PyTorch's default copy operation. It's particularly useful during checkpoint loading/saving operations or when moving data between pinned and non-pinned CPU buffers before I/O operations.

Code Reference

Source Location

Signature

int deepspeed_py_memcpy(torch::Tensor& dest, const torch::Tensor& src);

Import

#include "deepspeed_py_copy.h"

I/O Contract

Inputs

Name Type Required Description
dest torch::Tensor& Yes Destination tensor (will be modified in-place)
src const torch::Tensor& Yes Source tensor to copy from

Outputs

Name Type Description
return_code int 0 on success
dest torch::Tensor Updated with copied data from src

Usage Examples

import torch
from deepspeed.ops.aio import deepspeed_memcpy

# Create source and destination tensors
src = torch.randn(1024*1024, dtype=torch.float32)
dest = torch.empty_like(src)

# Fast SIMD-accelerated copy
deepspeed_memcpy(dest, src)

# Typical usage: copying to pinned buffer before I/O
regular_tensor = torch::randn(1024*1024)
pinned_tensor = io_handle.new_cpu_locked_tensor(1024*1024, regular_tensor)
deepspeed_memcpy(pinned_tensor, regular_tensor)
io_handle.async_pwrite(pinned_tensor, "/nvme/state.pt", 0)
// C++ usage
auto src = torch::randn({1024, 1024}, torch::kFloat32);
auto dest = torch::empty_like(src);

// Optimized copy
deepspeed_py_memcpy(dest, src);

// The function automatically:
// 1. Makes tensors contiguous
// 2. Uses 8x SIMD unrolling for bulk data
// 3. Falls back to 4x, then 1x for remaining data
// 4. Uses OpenMP to parallelize across cores
// 5. Handles non-aligned tail with scalar operations

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment