Implementation:Vllm project Vllm CPU Attn Dispatcher

Knowledge Sources	vllm
Domains	Attention, CPU_Inference
Last Updated	2026-02-08 00:00 GMT

Overview

Main CPU attention dispatcher that routes attention computations to ISA-specific kernel implementations (AMX, VEC, VEC16, NEON) based on hardware capabilities.

Description

This file implements three core functions for CPU-based attention: get_scheduler_metadata computes scheduling metadata for attention work partitioning, cpu_attn_reshape_and_cache reshapes and stores key/value tensors into the paged KV cache, and cpu_attention_with_kv_cache executes the full attention computation using cached KV pairs. All functions dispatch to specialized implementations via the CPU_ATTN_DISPATCH macro based on head dimension and ISA type.

Usage

These functions are compiled into the vLLM CPU extension and called from the Python attention backend. The ISA hint string ("amx", "vec", "vec16", "neon") is selected at runtime based on detected CPU features, enabling automatic hardware-optimized attention execution.

Code Reference

Source Location

Repository: vllm
File: csrc/cpu/cpu_attn.cpp
Lines: 1-185

Signature

torch::Tensor get_scheduler_metadata(
    const int64_t num_req, const int64_t num_heads_q,
    const int64_t num_heads_kv, const int64_t head_dim,
    const torch::Tensor& seq_lens, at::ScalarType dtype,
    const torch::Tensor& query_start_loc, const bool casual,
    const int64_t window_size, const std::string& isa_hint,
    const bool enable_kv_split);

void cpu_attn_reshape_and_cache(
    const torch::Tensor& key, const torch::Tensor& value,
    torch::Tensor& key_cache, torch::Tensor& value_cache,
    const torch::Tensor& slot_mapping, const std::string& isa);

void cpu_attention_with_kv_cache(
    const torch::Tensor& query, const torch::Tensor& key_cache,
    const torch::Tensor& value_cache, torch::Tensor& output,
    const torch::Tensor& query_start_loc, const torch::Tensor& seq_lens,
    const double scale, const bool causal,
    const std::optional<torch::Tensor>& alibi_slopes,
    const int64_t sliding_window_left, const int64_t sliding_window_right,
    const torch::Tensor& block_table, const double softcap,
    const torch::Tensor& scheduler_metadata,
    const std::optional<torch::Tensor>& s_aux);

Import

#include "cpu_attn_dispatch_generated.h"

I/O Contract

Inputs

Name	Type	Required	Description
query	torch::Tensor	Yes	Query tensor [num_tokens, num_heads, head_size]
key_cache	torch::Tensor	Yes	Paged key cache [num_blocks, num_kv_heads, block_size, head_size]
value_cache	torch::Tensor	Yes	Paged value cache [num_blocks, num_kv_heads, block_size, head_size]
output	torch::Tensor	Yes	Pre-allocated output tensor [num_tokens, num_heads, head_size]
seq_lens	torch::Tensor	Yes	Per-request sequence lengths [num_req]
query_start_loc	torch::Tensor	Yes	Start index of each request's queries [num_req + 1]
block_table	torch::Tensor	Yes	Block table mapping requests to KV cache blocks [num_req, max_block_num]
scale	double	Yes	Softmax scaling factor (typically 1/sqrt(head_dim))
causal	bool	Yes	Whether to apply causal attention masking
isa_hint	std::string	Yes	ISA selection hint: "amx", "vec", "vec16", or "neon"
alibi_slopes	torch::Tensor	No	ALiBi attention slopes [num_heads]
sliding_window_left	int64_t	No	Left sliding window size (-1 for no window)
softcap	double	No	Logits soft-capping value (0 for disabled)
scheduler_metadata	torch::Tensor	Yes	Opaque scheduling metadata from get_scheduler_metadata

Outputs

Name	Type	Description
output	torch::Tensor	Attention result written in-place [num_tokens, num_heads, head_size]
scheduler_metadata	torch::Tensor	Returned by get_scheduler_metadata for use in attention execution

Usage Examples

// Step 1: Get scheduler metadata
auto metadata = get_scheduler_metadata(
    num_req, num_heads_q, num_heads_kv, head_dim,
    seq_lens, torch::kBFloat16, query_start_loc,
    /*casual=*/true, /*window_size=*/-1,
    /*isa_hint=*/"amx", /*enable_kv_split=*/true);

// Step 2: Reshape and cache KV
cpu_attn_reshape_and_cache(key, value, key_cache, value_cache,
                           slot_mapping, "amx");

// Step 3: Execute attention
cpu_attention_with_kv_cache(
    query, key_cache, value_cache, output,
    query_start_loc, seq_lens, scale, /*causal=*/true,
    /*alibi_slopes=*/std::nullopt, /*sliding_window_left=*/-1,
    /*sliding_window_right=*/-1, block_table, /*softcap=*/0.0,
    metadata, /*s_aux=*/std::nullopt);

Related Pages

Environment:Vllm_project_Vllm_CPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment