Implementation:InternLM Lmdeploy Core ThreadMap
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Memory_Layout |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Compile-time thread-to-data mapping policies that determine how GPU threads are assigned to 2D tile elements for cooperative loading and storing.
Description
This header defines two thread mapping strategies for assigning threads within a CTA (Cooperative Thread Array) to elements of a 2D data tile. ThreadMapQ<C, S, AccessC, WarpCount> maps threads in a simple pattern where each warp covers C columns and iterates over S rows, suitable for query/key tiles. RakedThreadMap<DimC, DimS, AccessC, WarpCount, WarpThreadC> is a more flexible raked mapping that distributes warps along the S dimension and supports partial tiles when there is only one iteration in C. Both provide compile-time constants for iteration counts, footprints, deltas, and a get_offset(warp_id, lane_id) device function that returns the (column, row) starting offset for each thread.
Usage
Use these mappings to configure how threads cooperatively load tiles from global memory into shared memory or registers in GEMM and attention kernels.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/core/thread_map.h
Signature
template<int C, int S, int AccessC, int WarpCount>
struct ThreadMapQ {
static constexpr int kIterC, kIterS, kDeltaC, kDeltaS;
__device__ static int2 get_offset(int warp_id, int lane_id);
};
template<int DimC, int DimS, int AccessC, int WarpCount,
int WarpThreadC = lowbit(DimC) / AccessC>
struct RakedThreadMap {
static constexpr int kIterC, kIterS, kDeltaC, kDeltaS;
static constexpr bool kPartialC;
__device__ static int2 get_offset(int warp_id, int lane_id);
};
Import
#include "src/turbomind/kernels/core/thread_map.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| C / DimC | int | Yes | Column dimension of the tile |
| S / DimS | int | Yes | Row (sequence) dimension of the tile |
| AccessC | int | Yes | Number of contiguous columns accessed per thread |
| WarpCount | int | Yes | Number of warps in the CTA |
| warp_id | int | Yes | Warp index within the CTA |
| lane_id | int | Yes | Lane index within the warp (0-31) |
Outputs
| Name | Type | Description |
|---|---|---|
| get_offset return | int2 | (column_offset, row_offset) starting position for this thread |
| kIterC, kIterS | int (constexpr) | Number of iterations in each dimension |
Usage Examples
using namespace turbomind;
// Map 4 warps to a 128x64 tile, accessing 8 columns per thread
using TMap = RakedThreadMap<128, 64, 8, 4>;
int warp_id = threadIdx.x / 32;
int lane_id = threadIdx.x % 32;
int2 offset = TMap::get_offset(warp_id, lane_id);
// Load data at (offset.x, offset.y) with kIterC x kIterS iterations
for (int si = 0; si < TMap::kIterS; ++si) {
for (int ci = 0; ci < TMap::kIterC; ++ci) {
// load from (offset.x + ci * TMap::kDeltaC, offset.y + si * TMap::kDeltaS)
}
}