Implementation:InternLM Lmdeploy BatchCopy
| Knowledge Sources | |
|---|---|
| Domains | Memory_Management, GPU_Computing |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Implements a batched memory copy utility that coalesces multiple small copy operations into a single batched CUDA driver call (cuMemcpyBatchAsync) or falls back to sequential copies.
Description
The BatchCopy class accumulates source, destination, and size triples from individual copy requests. Adjacent copies (where the end of one source/destination matches the start of the next) are automatically merged within a "group" scope. The Group RAII helper enables this coalescing behavior. When Run() is called, the class checks for the availability of the cuMemcpyBatchAsync CUDA driver API (introduced in CUDA 12.8) via cudaGetDriverEntryPoint. If available, it issues a single batched copy with overlapping compute hints; otherwise, it falls back to per-entry core::Copy() calls. The class provides both typed pointer and Buffer-based copy interfaces. The batch is reset after each Run().
Usage
Used during KV-cache management and state permutation operations where many small, potentially contiguous memory copies need to be performed efficiently on the GPU. The grouping mechanism minimizes the number of driver API calls.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File (header): src/turbomind/core/copy.h
- File (impl): src/turbomind/core/copy.cc
- Lines: copy.h 1-140, copy.cc 1-119
Signature
namespace turbomind::core {
class BatchCopy {
public:
~BatchCopy();
BatchCopy();
BatchCopy(const BatchCopy&) = delete;
BatchCopy& operator=(const BatchCopy&) = delete;
class Group {
public:
~Group();
Group(BatchCopy& parent);
explicit constexpr operator bool() const noexcept;
};
Group group();
template<class T>
T* operator()(const T* src, ssize_t size, T* dst);
void operator()(const Buffer& src, ssize_t size, Ref<Buffer> dst_);
void Run();
Buffer_<BatchCopy*> buf();
};
} // namespace turbomind::core
Import
#include "src/turbomind/core/copy.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| src | const T* or const Buffer& | Yes | Source data pointer or buffer |
| size | ssize_t | Yes | Number of elements to copy |
| dst | T* or Ref<Buffer> | Yes | Destination data pointer or buffer reference |
Outputs
| Name | Type | Description |
|---|---|---|
| return | T* | Pointer past the end of the destination region (typed pointer variant) |
| (side effect) | void | Data is copied from source to destination on Run() |
Usage Examples
#include "src/turbomind/core/copy.h"
using namespace turbomind::core;
BatchCopy batch;
// Accumulate copies within a group for coalescing
if (auto g = batch.group()) {
for (int i = 0; i < n; ++i) {
batch(src_ptrs[i], sizes[i], dst_ptrs[i]);
}
}
// Execute all accumulated copies in one batched operation
batch.Run();