Implementation:InternLM Lmdeploy Gemm IteratorSm70

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, GEMM
Last Updated	2026-02-07 15:00 GMT

Overview

Global memory iterator for Volta (SM70) architecture that loads matrix tiles from global memory to registers using predicated loads with __ldcs streaming hints.

Description

GmemIteratorSm70 manages the traversal of a matrix operand in global memory for SM70 (Volta) GPUs. It uses explicit register-based loads (__ldcs for cache-streaming) rather than the cp.async instruction available on later architectures.

The iterator is templated on data type, thread map, shared memory layout, packing scheme, matrix order, alignment, striding mode, and cache policy. It maintains source data pointers, offsets for contiguous and strided dimensions, step sizes for K-dimension advancement, and a predicate mask for boundary checking. Data is fetched into register fragments and then stored to shared memory in a separate step, enabling the software-pipelined mainloop of MainloopSm70.

Supports flat, blocked, and indexed striding modes for batched/grouped GEMM.

Usage

Used by MainloopSm70 to load tiles of operands A, B, U, V from global memory into shared memory via register intermediaries.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/gemm/iterator_sm70.h

Signature

template<class T, class Map, class SmemLayout, Pack kPack, Order kOrder,
         bool AlignedC, bool AlignedS, Striding mode, class Policy_>
struct GmemIteratorSm70 {
    using ThreadMap = Map;
    using AccessType = Array<T, Map::kAccessC>;
    static constexpr Striding kMode = mode;
    // Fetch, Store, Advance, ClearSmem methods
};

Import

#include "src/turbomind/kernels/gemm/iterator_sm70.h"

I/O Contract

Inputs

Name	Type	Required	Description
mat	MatrixData	Yes	Resolved matrix pointer and stride
offset	int2	Yes	(M, K) starting offset into the matrix
extent	int2	Yes	(M, K) valid extent for boundary checking

Outputs

Name	Type	Description
Fragments	register arrays	Tile data loaded into thread-local register fragments
shared memory	smem	Data stored to shared memory after register fetch

Usage Examples

GmemIteratorSm70<half, ThreadMap, SmemLayout, pack, kColMajor, true, true, Striding::kFlat, Policy>
    gmem_A{mat_A, {offset_m, offset_k}, {extent_m, CTA_K}};
gmem_A.Fetch(fragments, mask);
gmem_A.Store(fragments);
gmem_A.Advance();

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment