Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy Core Smem

From Leeroopedia
Revision as of 15:14, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/InternLM_Lmdeploy_Core_Smem.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains GPU_Kernels, Memory_Layout
Last Updated 2026-02-07 15:00 GMT

Overview

Shared memory utility functions for ldmatrix PTX instructions and shared memory pointer conversion.

Description

This header provides device-inlined wrappers for the ldmatrix.sync.aligned PTX instructions, which load 8x8 b16 matrix fragments directly from shared memory into registers for Tensor Core consumption. It includes: cast_smem_ptr_to_uint() to convert a generic pointer to a shared memory address suitable for PTX; ldmatrix_m8n8_x4_b16, ldmatrix_m8n8_x2_b16, ldmatrix_m8n8_x1_b16 for loading 4, 2, or 1 matrix fragments; and transposed variants ldsm_x4_trans, ldsm_x2_trans, ldsm_x1_trans. Convenience overloads accept Array<uint32_t, N> for ergonomic fragment handling. All require SM75 (Turing) or later.

Usage

Use these primitives in GEMM or attention kernels to load matrix tiles from shared memory into register fragments before issuing MMA instructions.

Code Reference

Source Location

Signature

__device__ uint32_t cast_smem_ptr_to_uint(void const* const ptr);

__device__ void ldmatrix_m8n8_x4_b16(uint& d0, uint& d1, uint& d2, uint& d3, uint32_t smem_int_ptr);
__device__ void ldmatrix_m8n8_x2_b16(uint& d0, uint& d1, uint32_t smem_int_ptr);
__device__ void ldmatrix_m8n8_x1_b16(uint& d0, uint32_t smem_int_ptr);

__device__ void ldsm_x4(Array<uint32_t, 4>& d, uint32_t smem_int_ptr);
__device__ void ldsm_x2(Array<uint32_t, 2>& d, uint32_t smem_int_ptr);

__device__ void ldsm_x4_trans(Array<uint32_t, 4>& d, uint32_t smem_int_ptr);
__device__ void ldsm_x2_trans(Array<uint32_t, 2>& d, uint32_t smem_int_ptr);

Import

#include "src/turbomind/kernels/core/smem.h"

I/O Contract

Inputs

Name Type Required Description
smem_int_ptr uint32_t Yes Shared memory address as unsigned integer (from cast_smem_ptr_to_uint)
ptr void const* Yes Generic shared memory pointer to convert

Outputs

Name Type Description
d0..d3 uint& Register values loaded from shared memory matrix fragments
d Array<uint32_t, N>& Loaded matrix fragment data packed into array

Usage Examples

using namespace turbomind;

__shared__ half smem_tile[64 * 64];
uint32_t addr = cast_smem_ptr_to_uint(&smem_tile[offset]);

// Load 4 x m8n8 fragments
Array<uint32_t, 4> frag;
ldsm_x4(frag, addr);

// Load transposed fragments
Array<uint32_t, 2> frag_t;
ldsm_x2_trans(frag_t, addr);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment