Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy BatchCopy

From Leeroopedia
Revision as of 15:13, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/InternLM_Lmdeploy_BatchCopy.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Memory_Management, GPU_Computing
Last Updated 2026-02-07 15:00 GMT

Overview

Implements a batched memory copy utility that coalesces multiple small copy operations into a single batched CUDA driver call (cuMemcpyBatchAsync) or falls back to sequential copies.

Description

The BatchCopy class accumulates source, destination, and size triples from individual copy requests. Adjacent copies (where the end of one source/destination matches the start of the next) are automatically merged within a "group" scope. The Group RAII helper enables this coalescing behavior. When Run() is called, the class checks for the availability of the cuMemcpyBatchAsync CUDA driver API (introduced in CUDA 12.8) via cudaGetDriverEntryPoint. If available, it issues a single batched copy with overlapping compute hints; otherwise, it falls back to per-entry core::Copy() calls. The class provides both typed pointer and Buffer-based copy interfaces. The batch is reset after each Run().

Usage

Used during KV-cache management and state permutation operations where many small, potentially contiguous memory copies need to be performed efficiently on the GPU. The grouping mechanism minimizes the number of driver API calls.

Code Reference

Source Location

Signature

namespace turbomind::core {

class BatchCopy {
public:
    ~BatchCopy();
    BatchCopy();

    BatchCopy(const BatchCopy&) = delete;
    BatchCopy& operator=(const BatchCopy&) = delete;

    class Group {
    public:
        ~Group();
        Group(BatchCopy& parent);
        explicit constexpr operator bool() const noexcept;
    };

    Group group();

    template<class T>
    T* operator()(const T* src, ssize_t size, T* dst);

    void operator()(const Buffer& src, ssize_t size, Ref<Buffer> dst_);

    void Run();

    Buffer_<BatchCopy*> buf();
};

}  // namespace turbomind::core

Import

#include "src/turbomind/core/copy.h"

I/O Contract

Inputs

Name Type Required Description
src const T* or const Buffer& Yes Source data pointer or buffer
size ssize_t Yes Number of elements to copy
dst T* or Ref<Buffer> Yes Destination data pointer or buffer reference

Outputs

Name Type Description
return T* Pointer past the end of the destination region (typed pointer variant)
(side effect) void Data is copied from source to destination on Run()

Usage Examples

#include "src/turbomind/core/copy.h"

using namespace turbomind::core;

BatchCopy batch;

// Accumulate copies within a group for coalescing
if (auto g = batch.group()) {
    for (int i = 0; i < n; ++i) {
        batch(src_ptrs[i], sizes[i], dst_ptrs[i]);
    }
}

// Execute all accumulated copies in one batched operation
batch.Run();

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment