Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy Gemm IteratorSm70

From Leeroopedia


Knowledge Sources
Domains GPU_Kernels, GEMM
Last Updated 2026-02-07 15:00 GMT

Overview

Global memory iterator for Volta (SM70) architecture that loads matrix tiles from global memory to registers using predicated loads with __ldcs streaming hints.

Description

GmemIteratorSm70 manages the traversal of a matrix operand in global memory for SM70 (Volta) GPUs. It uses explicit register-based loads (__ldcs for cache-streaming) rather than the cp.async instruction available on later architectures.

The iterator is templated on data type, thread map, shared memory layout, packing scheme, matrix order, alignment, striding mode, and cache policy. It maintains source data pointers, offsets for contiguous and strided dimensions, step sizes for K-dimension advancement, and a predicate mask for boundary checking. Data is fetched into register fragments and then stored to shared memory in a separate step, enabling the software-pipelined mainloop of MainloopSm70.

Supports flat, blocked, and indexed striding modes for batched/grouped GEMM.

Usage

Used by MainloopSm70 to load tiles of operands A, B, U, V from global memory into shared memory via register intermediaries.

Code Reference

Source Location

Signature

template<class T, class Map, class SmemLayout, Pack kPack, Order kOrder,
         bool AlignedC, bool AlignedS, Striding mode, class Policy_>
struct GmemIteratorSm70 {
    using ThreadMap = Map;
    using AccessType = Array<T, Map::kAccessC>;
    static constexpr Striding kMode = mode;
    // Fetch, Store, Advance, ClearSmem methods
};

Import

#include "src/turbomind/kernels/gemm/iterator_sm70.h"

I/O Contract

Inputs

Name Type Required Description
mat MatrixData Yes Resolved matrix pointer and stride
offset int2 Yes (M, K) starting offset into the matrix
extent int2 Yes (M, K) valid extent for boundary checking

Outputs

Name Type Description
Fragments register arrays Tile data loaded into thread-local register fragments
shared memory smem Data stored to shared memory after register fetch

Usage Examples

GmemIteratorSm70<half, ThreadMap, SmemLayout, pack, kColMajor, true, true, Striding::kFlat, Policy>
    gmem_A{mat_A, {offset_m, offset_k}, {extent_m, CTA_K}};
gmem_A.Fetch(fragments, mask);
gmem_A.Store(fragments);
gmem_A.Advance();

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment