Implementation:InternLM Lmdeploy Core ThreadMap

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Memory_Layout
Last Updated	2026-02-07 15:00 GMT

Overview

Compile-time thread-to-data mapping policies that determine how GPU threads are assigned to 2D tile elements for cooperative loading and storing.

Description

This header defines two thread mapping strategies for assigning threads within a CTA (Cooperative Thread Array) to elements of a 2D data tile. ThreadMapQ<C, S, AccessC, WarpCount> maps threads in a simple pattern where each warp covers C columns and iterates over S rows, suitable for query/key tiles. RakedThreadMap<DimC, DimS, AccessC, WarpCount, WarpThreadC> is a more flexible raked mapping that distributes warps along the S dimension and supports partial tiles when there is only one iteration in C. Both provide compile-time constants for iteration counts, footprints, deltas, and a get_offset(warp_id, lane_id) device function that returns the (column, row) starting offset for each thread.

Usage

Use these mappings to configure how threads cooperatively load tiles from global memory into shared memory or registers in GEMM and attention kernels.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/core/thread_map.h

Signature

template<int C, int S, int AccessC, int WarpCount>
struct ThreadMapQ {
    static constexpr int kIterC, kIterS, kDeltaC, kDeltaS;
    __device__ static int2 get_offset(int warp_id, int lane_id);
};

template<int DimC, int DimS, int AccessC, int WarpCount,
         int WarpThreadC = lowbit(DimC) / AccessC>
struct RakedThreadMap {
    static constexpr int kIterC, kIterS, kDeltaC, kDeltaS;
    static constexpr bool kPartialC;
    __device__ static int2 get_offset(int warp_id, int lane_id);
};

Import

#include "src/turbomind/kernels/core/thread_map.h"

I/O Contract

Inputs

Name	Type	Required	Description
C / DimC	int	Yes	Column dimension of the tile
S / DimS	int	Yes	Row (sequence) dimension of the tile
AccessC	int	Yes	Number of contiguous columns accessed per thread
WarpCount	int	Yes	Number of warps in the CTA
warp_id	int	Yes	Warp index within the CTA
lane_id	int	Yes	Lane index within the warp (0-31)

Outputs

Name	Type	Description
get_offset return	int2	(column_offset, row_offset) starting position for this thread
kIterC, kIterS	int (constexpr)	Number of iterations in each dimension

Usage Examples

using namespace turbomind;

// Map 4 warps to a 128x64 tile, accessing 8 columns per thread
using TMap = RakedThreadMap<128, 64, 8, 4>;

int warp_id = threadIdx.x / 32;
int lane_id = threadIdx.x % 32;
int2 offset = TMap::get_offset(warp_id, lane_id);

// Load data at (offset.x, offset.y) with kIterC x kIterS iterations
for (int si = 0; si < TMap::kIterS; ++si) {
    for (int ci = 0; ci < TMap::kIterC; ++ci) {
        // load from (offset.x + ci * TMap::kDeltaC, offset.y + si * TMap::kDeltaS)
    }
}

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment