Implementation:Sgl project Sglang CPU Decode Attention
| Knowledge Sources | |
|---|---|
| Domains | Attention, CPU Compute |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Implements the CPU decode attention kernel for the autoregressive token generation phase, where each query has a single token attending to the full key-value cache.
Description
decode.cpp provides the CPU-optimized decode attention kernel, which is the performance-critical path during LLM inference's generation phase. The kernel uses VNNI (Vector Neural Network Instructions) format conversion for efficient KV cache access, following a strategy similar to FlashMLA.
The key data layout transformations are:
- Key packing: from [N, K/2, 2] to [K/2, N, 2] for row-major access during Q*K computation
- Value packing: from [N/2, 2, Kv] to [N/2, Kv, 2] for efficient score*V accumulation
The pack_vnni_Nx32 function handles 16x512-bit blocks at a time using _mm512_loadu_si512 intrinsics and transpose_2x32_16bit / transpose_16x16_32bit operations. The outer pack_vnni function iterates over blocks of 16 tokens and 32-element key/value chunks.
The kernel parallelizes across batches and attention heads, using blocked computation along the KV sequence dimension with online softmax accumulation. Key implementation notes from the code include:
- BLOCK_N tuning needed for different configurations
- Planning for {batches, num_heads, num_kv_splits} parallelism
- Logit softcapping via fast .tanh() approximation
- AMX kernel support for index_gemm_kernel_nn when M=16
Usage
This kernel is invoked automatically during the decode phase of LLM serving on CPU backends. It is called when processing single-token queries that attend to the full KV cache stored in paged memory.
Code Reference
Source Location
- Repository: Sgl_project_Sglang
- File: sgl-kernel/csrc/cpu/decode.cpp
- Lines: 1-1744
Signature
// VNNI packing for 16x512-bit blocks (AVX-512)
template <typename scalar_t, typename index_t>
inline void pack_vnni_Nx32(
scalar_t* __restrict__ dst0,
scalar_t* __restrict__ dst1,
const scalar_t* __restrict__ src,
const index_t* __restrict__ ind,
int N, int ld_src, int ld_dst0, int ld_dst1,
bool convert_v);
// Full VNNI format conversion for key and value
template <typename scalar_t, typename index_t>
void pack_vnni(
scalar_t* __restrict__ dst0,
scalar_t* __restrict__ dst1,
const scalar_t* __restrict__ src,
const index_t* __restrict__ ind,
int N, int K, int Kv,
int ld_src, int ld_dst0, int ld_dst1);
Import
#include "common.h"
#include "gemm.h"
#include "vec.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| q | scalar_t* | Yes | Query tensor for current decode step, shape [batches, num_heads, head_size] |
| kv_cache | scalar_t* | Yes | Paged KV cache buffer containing all cached keys and values |
| ind | index_t* | Yes | Index array mapping logical sequence positions to physical cache locations |
| N | int | Yes | Number of KV cache entries (sequence length) |
| K | int | Yes | Key head dimension size |
| Kv | int | Yes | Value head dimension size (may differ from K, e.g., MLA) |
| ld_src | int | Yes | Leading dimension stride for the source KV cache |
Outputs
| Name | Type | Description |
|---|---|---|
| dst0 | scalar_t* | Packed key buffer in VNNI format [K/2, N, 2] |
| dst1 | scalar_t* | Packed value buffer in VNNI format [N/2, Kv, 2] |
| output | scalar_t* | Attention output tensor, shape [batches, num_heads, head_size_v] |
Usage Examples
VNNI Key-Value Packing
// Pack KV cache entries for VNNI-optimized GEMM
pack_vnni<at::BFloat16, int32_t>(
key_packed, // dst0: [K/2, N, 2]
val_packed, // dst1: [N/2, Kv, 2]
kv_cache_ptr, // src: [N, K+Kv]
token_indices, // ind: physical cache indices
seq_len, // N: number of cached tokens
head_size, // K: key dimension
head_size_v, // Kv: value dimension
kv_stride, // ld_src
packed_k_stride, // ld_dst0
packed_v_stride // ld_dst1
);