Implementation:InternLM Lmdeploy Impl 884
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Attention |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Attention implementation using m8n8k4 tensor core MMA instructions for Volta (SM70) GPUs, with custom swizzle patterns for V shared memory and shared-memory-based S-to-P conversion.
Description
This is the MMA_884 specialization of Impl, targeting SM70 (Volta) GPUs which use the older m8n8k4 WMMA format. Fragment types follow the Volta thread-data mapping with 4-element register tiles and 8-element score/output fragments. The implementation includes a custom SwizzleV functor to avoid bank conflicts during LDS.128 loads of V. Unlike the Turing/Ampere implementations, S-to-P conversion writes probabilities to shared memory (SmemLayoutP) and reloads them as P fragments, because the Volta fragment layout makes direct register reinterpretation impractical. The shared memory layout uses padding (HeadDim+4, CTA_S+4) instead of swizzling for Q/K/P.
Usage
Selected by AttentionConfig for SM70 (Volta) prefill attention. Paired with the Sm70 mainloop.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/attention/impl_884.h
- Lines: 1-458
Signature
namespace turbomind::attention {
template<class T_, int CTA_H_, int CTA_Q_, int CTA_S_,
int WARP_H_, int WARP_Q, int WARP_S, int HeadDim>
struct Impl<MMA_884, T_, T_, CTA_H_, CTA_Q_, CTA_S_,
WARP_H_, WARP_Q, WARP_S, HeadDim> {
using T = T_;
using Tkv = T_;
static constexpr int OP_M = 16;
static constexpr int OP_N = 16;
static constexpr int OP_K = 4;
using FragQ = Array<half, 4>[K_K][K_M];
using FragK = Array<half, 4>[K_K][K_N];
using FragS = Array<float, 8>[K_M][K_N];
using FragP = Array<half, 4>[V_K][V_M];
using FragV = Array<half, 4>[V_K][V_N];
using FragO = Array<float, 8>[V_M][V_N];
struct SwizzleV { static __device__ int apply(int offset); };
struct SharedStorage {
union {
T Q[SmemLayoutQ::kSize];
struct { T K[...]; T V[...]; T P[...]; };
};
};
struct StateQK { ... };
struct StatePV { ... };
static void Softmax<is_residue>(FragS&, FragM&, FragL&, FragO&, float);
static void ConvertStoP(FragS&, FragP&, SharedStorage&);
};
} // namespace turbomind::attention
Import
#include "src/turbomind/kernels/attention/impl_884.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| T_ | typename | Yes | Data type (half) |
| CTA_Q_ | int | Yes | CTA query tile size |
| CTA_S_ | int | Yes | CTA key/sequence tile size |
| HeadDim | int | Yes | Head dimension |
Outputs
| Name | Type | Description |
|---|---|---|
| FragO | Array<float,8>[V_M][V_N] | Accumulated output fragment |
| FragM | Array<float,2>[V_M] | Row-wise maximum |
| FragL | Array<float,2>[V_M] | Row-wise sum |
Usage Examples
// Volta prefill config
using Attention = Impl<MMA_884, half, half, 1, 64, 64, 1, 16, 64, 128, 2>;
using Kernel = AttentionUniversal<arch::Sm70,
Mainloop<arch::Sm70, Attention>,
GetCacheIterFactory<CacheType::kBlock, half, 64, 128>,
AttentionCtaMap>;