Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy Core ThreadMap

From Leeroopedia


Knowledge Sources
Domains GPU_Kernels, Memory_Layout
Last Updated 2026-02-07 15:00 GMT

Overview

Compile-time thread-to-data mapping policies that determine how GPU threads are assigned to 2D tile elements for cooperative loading and storing.

Description

This header defines two thread mapping strategies for assigning threads within a CTA (Cooperative Thread Array) to elements of a 2D data tile. ThreadMapQ<C, S, AccessC, WarpCount> maps threads in a simple pattern where each warp covers C columns and iterates over S rows, suitable for query/key tiles. RakedThreadMap<DimC, DimS, AccessC, WarpCount, WarpThreadC> is a more flexible raked mapping that distributes warps along the S dimension and supports partial tiles when there is only one iteration in C. Both provide compile-time constants for iteration counts, footprints, deltas, and a get_offset(warp_id, lane_id) device function that returns the (column, row) starting offset for each thread.

Usage

Use these mappings to configure how threads cooperatively load tiles from global memory into shared memory or registers in GEMM and attention kernels.

Code Reference

Source Location

Signature

template<int C, int S, int AccessC, int WarpCount>
struct ThreadMapQ {
    static constexpr int kIterC, kIterS, kDeltaC, kDeltaS;
    __device__ static int2 get_offset(int warp_id, int lane_id);
};

template<int DimC, int DimS, int AccessC, int WarpCount,
         int WarpThreadC = lowbit(DimC) / AccessC>
struct RakedThreadMap {
    static constexpr int kIterC, kIterS, kDeltaC, kDeltaS;
    static constexpr bool kPartialC;
    __device__ static int2 get_offset(int warp_id, int lane_id);
};

Import

#include "src/turbomind/kernels/core/thread_map.h"

I/O Contract

Inputs

Name Type Required Description
C / DimC int Yes Column dimension of the tile
S / DimS int Yes Row (sequence) dimension of the tile
AccessC int Yes Number of contiguous columns accessed per thread
WarpCount int Yes Number of warps in the CTA
warp_id int Yes Warp index within the CTA
lane_id int Yes Lane index within the warp (0-31)

Outputs

Name Type Description
get_offset return int2 (column_offset, row_offset) starting position for this thread
kIterC, kIterS int (constexpr) Number of iterations in each dimension

Usage Examples

using namespace turbomind;

// Map 4 warps to a 128x64 tile, accessing 8 columns per thread
using TMap = RakedThreadMap<128, 64, 8, 4>;

int warp_id = threadIdx.x / 32;
int lane_id = threadIdx.x % 32;
int2 offset = TMap::get_offset(warp_id, lane_id);

// Load data at (offset.x, offset.y) with kIterC x kIterS iterations
for (int si = 0; si < TMap::kIterS; ++si) {
    for (int ci = 0; ci < TMap::kIterC; ++ci) {
        // load from (offset.x + ci * TMap::kDeltaC, offset.y + si * TMap::kDeltaS)
    }
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment