Implementation:InternLM Lmdeploy State

Knowledge Sources	InternLM_Lmdeploy
Domains	Tensor_Operations, KV_Cache
Last Updated	2026-02-07 15:00 GMT

Overview

Provides a double-buffered tensor state container and permutation-based warp/append functions for efficient KV-cache and sequence state management during inference.

Description

The State struct holds two tensors (data_[2]) that serve as a double buffer for ping-pong style state management. front() and back() access the two buffers, and Swap() exchanges them. This enables constant-time state updates: write to the back buffer while reading from the front, then swap.

The file also provides several template Warp function overloads and an Append function that perform permutation-based data rearrangement using a caller-supplied copy functor:

Warp(a0, size0, perm, b1, copy) -- copies rows from source tensor a0 to destination b1 according to permutation perm
Warp(a0, b1, size0, perm, c1, copy) -- selects between two sources based on whether perm[i] < size0
Warp with variable-size offset arrays -- handles variable-length data with offset indexing
Append -- merges existing state with new tokens, handling variable-size rows with stride-based layout

These are designed for minimal cudaMemcpy/kernel launches and single-stream operation.

Usage

Used for managing per-sequence KV-cache states during continuous batching. When sequences are reordered, added, or removed between iterations, the Warp/Append functions efficiently rearrange state tensors according to the new permutation without redundant copies.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/core/state.h
Lines: 1-152

Signature

namespace turbomind {

struct State {
    Tensor data_[2];

    State() = default;
    State(const Layout& layout, DataType dtype, const core::Device& device);

    Tensor& front();
    Tensor& back();
    void Swap();
};

template<class Copy>
void Warp(const Tensor& a0, int size0, const Buffer_<int>& perm,
          Tensor b1, Copy& copy);

template<class Copy>
void Warp(const Tensor& a0, const Tensor& b1, int size0,
          const Buffer_<int>& perm, Tensor c1, Copy& copy);

template<class Copy>
void Warp(const Tensor& src0, const Buffer_<int>& offset0, int size0,
          const Tensor& src1, const Buffer_<int>& offset1,
          const Buffer_<int>& perm0, Tensor dst, Buffer_<int> offsetd,
          Copy& copy);

template<class Copy>
void Append(const Tensor& a0, const Buffer_<int>& a0_size,
            const Tensor& b0, const Tensor& c1,
            const Buffer_<int>& c1_offset, const Buffer_<int>& perm,
            int size0, Tensor d1, Buffer_<int> d1_size, Copy& copy);

}  // namespace turbomind

Import

#include "src/turbomind/core/state.h"

I/O Contract

Inputs

Name	Type	Required	Description
layout	const Layout&	State ctor	Shape descriptor for both buffers
dtype	DataType	State ctor	Element data type
device	const core::Device&	State ctor	Device placement
perm	const Buffer_<int>&	Warp/Append	Permutation indices mapping output positions to input positions
size0	int	Warp/Append	Size of the "old" source, used to distinguish old vs new data
copy	Copy&	Warp/Append	Copy functor (e.g., BatchCopy)

Outputs

Name	Type	Description
front()	Tensor&	The current front buffer
back()	Tensor&	The current back buffer
(side effect)	Tensor	Destination tensor populated by Warp/Append

Usage Examples

#include "src/turbomind/core/state.h"

using namespace turbomind;

// Create double-buffered state for 32 sequences, 128 hidden dim
State kv_state(Layout({32, 128}), kFloat16, core::Device(kDEVICE));

// Access current and next buffers
Tensor& current = kv_state.front();
Tensor& next    = kv_state.back();

// Rearrange state according to permutation
BatchCopy copy;
Warp(current, old_size, perm, next, copy);
copy.Run();

kv_state.Swap();

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment