Implementation:Vllm project Vllm CPU Attn Dispatcher
| Knowledge Sources | |
|---|---|
| Domains | Attention, CPU_Inference |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Main CPU attention dispatcher that routes attention computations to ISA-specific kernel implementations (AMX, VEC, VEC16, NEON) based on hardware capabilities.
Description
This file implements three core functions for CPU-based attention: get_scheduler_metadata computes scheduling metadata for attention work partitioning, cpu_attn_reshape_and_cache reshapes and stores key/value tensors into the paged KV cache, and cpu_attention_with_kv_cache executes the full attention computation using cached KV pairs. All functions dispatch to specialized implementations via the CPU_ATTN_DISPATCH macro based on head dimension and ISA type.
Usage
These functions are compiled into the vLLM CPU extension and called from the Python attention backend. The ISA hint string ("amx", "vec", "vec16", "neon") is selected at runtime based on detected CPU features, enabling automatic hardware-optimized attention execution.
Code Reference
Source Location
- Repository: vllm
- File: csrc/cpu/cpu_attn.cpp
- Lines: 1-185
Signature
torch::Tensor get_scheduler_metadata(
const int64_t num_req, const int64_t num_heads_q,
const int64_t num_heads_kv, const int64_t head_dim,
const torch::Tensor& seq_lens, at::ScalarType dtype,
const torch::Tensor& query_start_loc, const bool casual,
const int64_t window_size, const std::string& isa_hint,
const bool enable_kv_split);
void cpu_attn_reshape_and_cache(
const torch::Tensor& key, const torch::Tensor& value,
torch::Tensor& key_cache, torch::Tensor& value_cache,
const torch::Tensor& slot_mapping, const std::string& isa);
void cpu_attention_with_kv_cache(
const torch::Tensor& query, const torch::Tensor& key_cache,
const torch::Tensor& value_cache, torch::Tensor& output,
const torch::Tensor& query_start_loc, const torch::Tensor& seq_lens,
const double scale, const bool causal,
const std::optional<torch::Tensor>& alibi_slopes,
const int64_t sliding_window_left, const int64_t sliding_window_right,
const torch::Tensor& block_table, const double softcap,
const torch::Tensor& scheduler_metadata,
const std::optional<torch::Tensor>& s_aux);
Import
#include "cpu_attn_dispatch_generated.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| query | torch::Tensor | Yes | Query tensor [num_tokens, num_heads, head_size] |
| key_cache | torch::Tensor | Yes | Paged key cache [num_blocks, num_kv_heads, block_size, head_size] |
| value_cache | torch::Tensor | Yes | Paged value cache [num_blocks, num_kv_heads, block_size, head_size] |
| output | torch::Tensor | Yes | Pre-allocated output tensor [num_tokens, num_heads, head_size] |
| seq_lens | torch::Tensor | Yes | Per-request sequence lengths [num_req] |
| query_start_loc | torch::Tensor | Yes | Start index of each request's queries [num_req + 1] |
| block_table | torch::Tensor | Yes | Block table mapping requests to KV cache blocks [num_req, max_block_num] |
| scale | double | Yes | Softmax scaling factor (typically 1/sqrt(head_dim)) |
| causal | bool | Yes | Whether to apply causal attention masking |
| isa_hint | std::string | Yes | ISA selection hint: "amx", "vec", "vec16", or "neon" |
| alibi_slopes | torch::Tensor | No | ALiBi attention slopes [num_heads] |
| sliding_window_left | int64_t | No | Left sliding window size (-1 for no window) |
| softcap | double | No | Logits soft-capping value (0 for disabled) |
| scheduler_metadata | torch::Tensor | Yes | Opaque scheduling metadata from get_scheduler_metadata |
Outputs
| Name | Type | Description |
|---|---|---|
| output | torch::Tensor | Attention result written in-place [num_tokens, num_heads, head_size] |
| scheduler_metadata | torch::Tensor | Returned by get_scheduler_metadata for use in attention execution |
Usage Examples
// Step 1: Get scheduler metadata
auto metadata = get_scheduler_metadata(
num_req, num_heads_q, num_heads_kv, head_dim,
seq_lens, torch::kBFloat16, query_start_loc,
/*casual=*/true, /*window_size=*/-1,
/*isa_hint=*/"amx", /*enable_kv_split=*/true);
// Step 2: Reshape and cache KV
cpu_attn_reshape_and_cache(key, value, key_cache, value_cache,
slot_mapping, "amx");
// Step 3: Execute attention
cpu_attention_with_kv_cache(
query, key_cache, value_cache, output,
query_start_loc, seq_lens, scale, /*causal=*/true,
/*alibi_slopes=*/std::nullopt, /*sliding_window_left=*/-1,
/*sliding_window_right=*/-1, block_table, /*softcap=*/0.0,
metadata, /*s_aux=*/std::nullopt);