Implementation:InternLM Lmdeploy State
| Knowledge Sources | |
|---|---|
| Domains | Tensor_Operations, KV_Cache |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Provides a double-buffered tensor state container and permutation-based warp/append functions for efficient KV-cache and sequence state management during inference.
Description
The State struct holds two tensors (data_[2]) that serve as a double buffer for ping-pong style state management. front() and back() access the two buffers, and Swap() exchanges them. This enables constant-time state updates: write to the back buffer while reading from the front, then swap.
The file also provides several template Warp function overloads and an Append function that perform permutation-based data rearrangement using a caller-supplied copy functor:
Warp(a0, size0, perm, b1, copy)-- copies rows from source tensora0to destinationb1according to permutationpermWarp(a0, b1, size0, perm, c1, copy)-- selects between two sources based on whetherperm[i] < size0Warpwith variable-size offset arrays -- handles variable-length data with offset indexingAppend-- merges existing state with new tokens, handling variable-size rows with stride-based layout
These are designed for minimal cudaMemcpy/kernel launches and single-stream operation.
Usage
Used for managing per-sequence KV-cache states during continuous batching. When sequences are reordered, added, or removed between iterations, the Warp/Append functions efficiently rearrange state tensors according to the new permutation without redundant copies.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/core/state.h
- Lines: 1-152
Signature
namespace turbomind {
struct State {
Tensor data_[2];
State() = default;
State(const Layout& layout, DataType dtype, const core::Device& device);
Tensor& front();
Tensor& back();
void Swap();
};
template<class Copy>
void Warp(const Tensor& a0, int size0, const Buffer_<int>& perm,
Tensor b1, Copy& copy);
template<class Copy>
void Warp(const Tensor& a0, const Tensor& b1, int size0,
const Buffer_<int>& perm, Tensor c1, Copy& copy);
template<class Copy>
void Warp(const Tensor& src0, const Buffer_<int>& offset0, int size0,
const Tensor& src1, const Buffer_<int>& offset1,
const Buffer_<int>& perm0, Tensor dst, Buffer_<int> offsetd,
Copy& copy);
template<class Copy>
void Append(const Tensor& a0, const Buffer_<int>& a0_size,
const Tensor& b0, const Tensor& c1,
const Buffer_<int>& c1_offset, const Buffer_<int>& perm,
int size0, Tensor d1, Buffer_<int> d1_size, Copy& copy);
} // namespace turbomind
Import
#include "src/turbomind/core/state.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| layout | const Layout& | State ctor | Shape descriptor for both buffers |
| dtype | DataType | State ctor | Element data type |
| device | const core::Device& | State ctor | Device placement |
| perm | const Buffer_<int>& | Warp/Append | Permutation indices mapping output positions to input positions |
| size0 | int | Warp/Append | Size of the "old" source, used to distinguish old vs new data |
| copy | Copy& | Warp/Append | Copy functor (e.g., BatchCopy) |
Outputs
| Name | Type | Description |
|---|---|---|
| front() | Tensor& | The current front buffer |
| back() | Tensor& | The current back buffer |
| (side effect) | Tensor | Destination tensor populated by Warp/Append |
Usage Examples
#include "src/turbomind/core/state.h"
using namespace turbomind;
// Create double-buffered state for 32 sequences, 128 hidden dim
State kv_state(Layout({32, 128}), kFloat16, core::Device(kDEVICE));
// Access current and next buffers
Tensor& current = kv_state.front();
Tensor& next = kv_state.back();
// Rearrange state according to permutation
BatchCopy copy;
Warp(current, old_size, perm, next, copy);
copy.Run();
kv_state.Swap();