Implementation:Sgl project Sglang CPU Decode Attention

Knowledge Sources	Sgl_project_Sglang
Domains	Attention, CPU Compute
Last Updated	2026-02-10 00:00 GMT

Overview

Implements the CPU decode attention kernel for the autoregressive token generation phase, where each query has a single token attending to the full key-value cache.

Description

decode.cpp provides the CPU-optimized decode attention kernel, which is the performance-critical path during LLM inference's generation phase. The kernel uses VNNI (Vector Neural Network Instructions) format conversion for efficient KV cache access, following a strategy similar to FlashMLA.

The key data layout transformations are:

Key packing: from [N, K/2, 2] to [K/2, N, 2] for row-major access during Q*K computation
Value packing: from [N/2, 2, Kv] to [N/2, Kv, 2] for efficient score*V accumulation

The pack_vnni_Nx32 function handles 16x512-bit blocks at a time using _mm512_loadu_si512 intrinsics and transpose_2x32_16bit / transpose_16x16_32bit operations. The outer pack_vnni function iterates over blocks of 16 tokens and 32-element key/value chunks.

The kernel parallelizes across batches and attention heads, using blocked computation along the KV sequence dimension with online softmax accumulation. Key implementation notes from the code include:

BLOCK_N tuning needed for different configurations
Planning for {batches, num_heads, num_kv_splits} parallelism
Logit softcapping via fast .tanh() approximation
AMX kernel support for index_gemm_kernel_nn when M=16

Usage

This kernel is invoked automatically during the decode phase of LLM serving on CPU backends. It is called when processing single-token queries that attend to the full KV cache stored in paged memory.

Code Reference

Source Location

Repository: Sgl_project_Sglang
File: sgl-kernel/csrc/cpu/decode.cpp
Lines: 1-1744

Signature

// VNNI packing for 16x512-bit blocks (AVX-512)
template <typename scalar_t, typename index_t>
inline void pack_vnni_Nx32(
    scalar_t* __restrict__ dst0,
    scalar_t* __restrict__ dst1,
    const scalar_t* __restrict__ src,
    const index_t* __restrict__ ind,
    int N, int ld_src, int ld_dst0, int ld_dst1,
    bool convert_v);

// Full VNNI format conversion for key and value
template <typename scalar_t, typename index_t>
void pack_vnni(
    scalar_t* __restrict__ dst0,
    scalar_t* __restrict__ dst1,
    const scalar_t* __restrict__ src,
    const index_t* __restrict__ ind,
    int N, int K, int Kv,
    int ld_src, int ld_dst0, int ld_dst1);

Import

#include "common.h"
#include "gemm.h"
#include "vec.h"

I/O Contract

Inputs

Name	Type	Required	Description
q	scalar_t*	Yes	Query tensor for current decode step, shape [batches, num_heads, head_size]
kv_cache	scalar_t*	Yes	Paged KV cache buffer containing all cached keys and values
ind	index_t*	Yes	Index array mapping logical sequence positions to physical cache locations
N	int	Yes	Number of KV cache entries (sequence length)
K	int	Yes	Key head dimension size
Kv	int	Yes	Value head dimension size (may differ from K, e.g., MLA)
ld_src	int	Yes	Leading dimension stride for the source KV cache

Outputs

Name	Type	Description
dst0	scalar_t*	Packed key buffer in VNNI format [K/2, N, 2]
dst1	scalar_t*	Packed value buffer in VNNI format [N/2, Kv, 2]
output	scalar_t*	Attention output tensor, shape [batches, num_heads, head_size_v]

Usage Examples

VNNI Key-Value Packing

// Pack KV cache entries for VNNI-optimized GEMM
pack_vnni<at::BFloat16, int32_t>(
    key_packed,       // dst0: [K/2, N, 2]
    val_packed,       // dst1: [N/2, Kv, 2]
    kv_cache_ptr,     // src: [N, K+Kv]
    token_indices,    // ind: physical cache indices
    seq_len,          // N: number of cached tokens
    head_size,        // K: key dimension
    head_size_v,      // Kv: value dimension
    kv_stride,        // ld_src
    packed_k_stride,  // ld_dst0
    packed_v_stride   // ld_dst1
);

Related Pages

Environment:Sgl_project_Sglang_CPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment