Implementation:InternLM Lmdeploy Impl 884

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Attention
Last Updated	2026-02-07 15:00 GMT

Overview

Attention implementation using m8n8k4 tensor core MMA instructions for Volta (SM70) GPUs, with custom swizzle patterns for V shared memory and shared-memory-based S-to-P conversion.

Description

This is the MMA_884 specialization of Impl, targeting SM70 (Volta) GPUs which use the older m8n8k4 WMMA format. Fragment types follow the Volta thread-data mapping with 4-element register tiles and 8-element score/output fragments. The implementation includes a custom SwizzleV functor to avoid bank conflicts during LDS.128 loads of V. Unlike the Turing/Ampere implementations, S-to-P conversion writes probabilities to shared memory (SmemLayoutP) and reloads them as P fragments, because the Volta fragment layout makes direct register reinterpretation impractical. The shared memory layout uses padding (HeadDim+4, CTA_S+4) instead of swizzling for Q/K/P.

Usage

Selected by AttentionConfig for SM70 (Volta) prefill attention. Paired with the Sm70 mainloop.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/attention/impl_884.h
Lines: 1-458

Signature

namespace turbomind::attention {

template<class T_, int CTA_H_, int CTA_Q_, int CTA_S_,
         int WARP_H_, int WARP_Q, int WARP_S, int HeadDim>
struct Impl<MMA_884, T_, T_, CTA_H_, CTA_Q_, CTA_S_,
            WARP_H_, WARP_Q, WARP_S, HeadDim> {

    using T = T_;
    using Tkv = T_;
    static constexpr int OP_M = 16;
    static constexpr int OP_N = 16;
    static constexpr int OP_K = 4;

    using FragQ = Array<half, 4>[K_K][K_M];
    using FragK = Array<half, 4>[K_K][K_N];
    using FragS = Array<float, 8>[K_M][K_N];
    using FragP = Array<half, 4>[V_K][V_M];
    using FragV = Array<half, 4>[V_K][V_N];
    using FragO = Array<float, 8>[V_M][V_N];

    struct SwizzleV { static __device__ int apply(int offset); };

    struct SharedStorage {
        union {
            T Q[SmemLayoutQ::kSize];
            struct { T K[...]; T V[...]; T P[...]; };
        };
    };

    struct StateQK { ... };
    struct StatePV { ... };

    static void Softmax<is_residue>(FragS&, FragM&, FragL&, FragO&, float);
    static void ConvertStoP(FragS&, FragP&, SharedStorage&);
};

} // namespace turbomind::attention

Import

#include "src/turbomind/kernels/attention/impl_884.h"

I/O Contract

Inputs

Name	Type	Required	Description
T_	typename	Yes	Data type (half)
CTA_Q_	int	Yes	CTA query tile size
CTA_S_	int	Yes	CTA key/sequence tile size
HeadDim	int	Yes	Head dimension

Outputs

Name	Type	Description
FragO	Array<float,8>[V_M][V_N]	Accumulated output fragment
FragM	Array<float,2>[V_M]	Row-wise maximum
FragL	Array<float,2>[V_M]	Row-wise sum

Usage Examples

// Volta prefill config
using Attention = Impl<MMA_884, half, half, 1, 64, 64, 1, 16, 64, 128, 2>;
using Kernel = AttentionUniversal<arch::Sm70,
    Mainloop<arch::Sm70, Attention>,
    GetCacheIterFactory<CacheType::kBlock, half, 64, 128>,
    AttentionCtaMap>;

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment