Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Sgl project Sglang CPU Decode Attention

From Leeroopedia


Knowledge Sources
Domains Attention, CPU Compute
Last Updated 2026-02-10 00:00 GMT

Overview

Implements the CPU decode attention kernel for the autoregressive token generation phase, where each query has a single token attending to the full key-value cache.

Description

decode.cpp provides the CPU-optimized decode attention kernel, which is the performance-critical path during LLM inference's generation phase. The kernel uses VNNI (Vector Neural Network Instructions) format conversion for efficient KV cache access, following a strategy similar to FlashMLA.

The key data layout transformations are:

  • Key packing: from [N, K/2, 2] to [K/2, N, 2] for row-major access during Q*K computation
  • Value packing: from [N/2, 2, Kv] to [N/2, Kv, 2] for efficient score*V accumulation

The pack_vnni_Nx32 function handles 16x512-bit blocks at a time using _mm512_loadu_si512 intrinsics and transpose_2x32_16bit / transpose_16x16_32bit operations. The outer pack_vnni function iterates over blocks of 16 tokens and 32-element key/value chunks.

The kernel parallelizes across batches and attention heads, using blocked computation along the KV sequence dimension with online softmax accumulation. Key implementation notes from the code include:

  • BLOCK_N tuning needed for different configurations
  • Planning for {batches, num_heads, num_kv_splits} parallelism
  • Logit softcapping via fast .tanh() approximation
  • AMX kernel support for index_gemm_kernel_nn when M=16

Usage

This kernel is invoked automatically during the decode phase of LLM serving on CPU backends. It is called when processing single-token queries that attend to the full KV cache stored in paged memory.

Code Reference

Source Location

Signature

// VNNI packing for 16x512-bit blocks (AVX-512)
template <typename scalar_t, typename index_t>
inline void pack_vnni_Nx32(
    scalar_t* __restrict__ dst0,
    scalar_t* __restrict__ dst1,
    const scalar_t* __restrict__ src,
    const index_t* __restrict__ ind,
    int N, int ld_src, int ld_dst0, int ld_dst1,
    bool convert_v);

// Full VNNI format conversion for key and value
template <typename scalar_t, typename index_t>
void pack_vnni(
    scalar_t* __restrict__ dst0,
    scalar_t* __restrict__ dst1,
    const scalar_t* __restrict__ src,
    const index_t* __restrict__ ind,
    int N, int K, int Kv,
    int ld_src, int ld_dst0, int ld_dst1);

Import

#include "common.h"
#include "gemm.h"
#include "vec.h"

I/O Contract

Inputs

Name Type Required Description
q scalar_t* Yes Query tensor for current decode step, shape [batches, num_heads, head_size]
kv_cache scalar_t* Yes Paged KV cache buffer containing all cached keys and values
ind index_t* Yes Index array mapping logical sequence positions to physical cache locations
N int Yes Number of KV cache entries (sequence length)
K int Yes Key head dimension size
Kv int Yes Value head dimension size (may differ from K, e.g., MLA)
ld_src int Yes Leading dimension stride for the source KV cache

Outputs

Name Type Description
dst0 scalar_t* Packed key buffer in VNNI format [K/2, N, 2]
dst1 scalar_t* Packed value buffer in VNNI format [N/2, Kv, 2]
output scalar_t* Attention output tensor, shape [batches, num_heads, head_size_v]

Usage Examples

VNNI Key-Value Packing

// Pack KV cache entries for VNNI-optimized GEMM
pack_vnni<at::BFloat16, int32_t>(
    key_packed,       // dst0: [K/2, N, 2]
    val_packed,       // dst1: [N/2, Kv, 2]
    kv_cache_ptr,     // src: [N, K+Kv]
    token_indices,    // ind: physical cache indices
    seq_len,          // N: number of cached tokens
    head_size,        // K: key dimension
    head_size_v,      // Kv: value dimension
    kv_stride,        // ld_src
    packed_k_stride,  // ld_dst0
    packed_v_stride   // ld_dst1
);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment