Implementation:InternLM Lmdeploy BatchCopy

Knowledge Sources	InternLM_Lmdeploy
Domains	Memory_Management, GPU_Computing
Last Updated	2026-02-07 15:00 GMT

Overview

Implements a batched memory copy utility that coalesces multiple small copy operations into a single batched CUDA driver call (cuMemcpyBatchAsync) or falls back to sequential copies.

Description

The BatchCopy class accumulates source, destination, and size triples from individual copy requests. Adjacent copies (where the end of one source/destination matches the start of the next) are automatically merged within a "group" scope. The Group RAII helper enables this coalescing behavior. When Run() is called, the class checks for the availability of the cuMemcpyBatchAsync CUDA driver API (introduced in CUDA 12.8) via cudaGetDriverEntryPoint. If available, it issues a single batched copy with overlapping compute hints; otherwise, it falls back to per-entry core::Copy() calls. The class provides both typed pointer and Buffer-based copy interfaces. The batch is reset after each Run().

Usage

Used during KV-cache management and state permutation operations where many small, potentially contiguous memory copies need to be performed efficiently on the GPU. The grouping mechanism minimizes the number of driver API calls.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File (header): src/turbomind/core/copy.h
File (impl): src/turbomind/core/copy.cc
Lines: copy.h 1-140, copy.cc 1-119

Signature

namespace turbomind::core {

class BatchCopy {
public:
    ~BatchCopy();
    BatchCopy();

    BatchCopy(const BatchCopy&) = delete;
    BatchCopy& operator=(const BatchCopy&) = delete;

    class Group {
    public:
        ~Group();
        Group(BatchCopy& parent);
        explicit constexpr operator bool() const noexcept;
    };

    Group group();

    template<class T>
    T* operator()(const T* src, ssize_t size, T* dst);

    void operator()(const Buffer& src, ssize_t size, Ref<Buffer> dst_);

    void Run();

    Buffer_<BatchCopy*> buf();
};

}  // namespace turbomind::core

Import

#include "src/turbomind/core/copy.h"

I/O Contract

Inputs

Name	Type	Required	Description
src	const T* or const Buffer&	Yes	Source data pointer or buffer
size	ssize_t	Yes	Number of elements to copy
dst	T* or Ref<Buffer>	Yes	Destination data pointer or buffer reference

Outputs

Name	Type	Description
return	T*	Pointer past the end of the destination region (typed pointer variant)
(side effect)	void	Data is copied from source to destination on Run()

Usage Examples

#include "src/turbomind/core/copy.h"

using namespace turbomind::core;

BatchCopy batch;

// Accumulate copies within a group for coalescing
if (auto g = batch.group()) {
    for (int i = 0; i < n; ++i) {
        batch(src_ptrs[i], sizes[i], dst_ptrs[i]);
    }
}

// Execute all accumulated copies in one batched operation
batch.Run();

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment