Implementation:InternLM Lmdeploy Gemm IteratorSm70
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, GEMM |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Global memory iterator for Volta (SM70) architecture that loads matrix tiles from global memory to registers using predicated loads with __ldcs streaming hints.
Description
GmemIteratorSm70 manages the traversal of a matrix operand in global memory for SM70 (Volta) GPUs. It uses explicit register-based loads (__ldcs for cache-streaming) rather than the cp.async instruction available on later architectures.
The iterator is templated on data type, thread map, shared memory layout, packing scheme, matrix order, alignment, striding mode, and cache policy. It maintains source data pointers, offsets for contiguous and strided dimensions, step sizes for K-dimension advancement, and a predicate mask for boundary checking. Data is fetched into register fragments and then stored to shared memory in a separate step, enabling the software-pipelined mainloop of MainloopSm70.
Supports flat, blocked, and indexed striding modes for batched/grouped GEMM.
Usage
Used by MainloopSm70 to load tiles of operands A, B, U, V from global memory into shared memory via register intermediaries.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/gemm/iterator_sm70.h
Signature
template<class T, class Map, class SmemLayout, Pack kPack, Order kOrder,
bool AlignedC, bool AlignedS, Striding mode, class Policy_>
struct GmemIteratorSm70 {
using ThreadMap = Map;
using AccessType = Array<T, Map::kAccessC>;
static constexpr Striding kMode = mode;
// Fetch, Store, Advance, ClearSmem methods
};
Import
#include "src/turbomind/kernels/gemm/iterator_sm70.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| mat | MatrixData | Yes | Resolved matrix pointer and stride |
| offset | int2 | Yes | (M, K) starting offset into the matrix |
| extent | int2 | Yes | (M, K) valid extent for boundary checking |
Outputs
| Name | Type | Description |
|---|---|---|
| Fragments | register arrays | Tile data loaded into thread-local register fragments |
| shared memory | smem | Data stored to shared memory after register fetch |
Usage Examples
GmemIteratorSm70<half, ThreadMap, SmemLayout, pack, kColMajor, true, true, Striding::kFlat, Policy>
gmem_A{mat_A, {offset_m, offset_k}, {extent_m, CTA_K}};
gmem_A.Fetch(fragments, mask);
gmem_A.Store(fragments);
gmem_A.Advance();