Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Vllm project Vllm CPU Attn Dispatcher

From Leeroopedia


Knowledge Sources
Domains Attention, CPU_Inference
Last Updated 2026-02-08 00:00 GMT

Overview

Main CPU attention dispatcher that routes attention computations to ISA-specific kernel implementations (AMX, VEC, VEC16, NEON) based on hardware capabilities.

Description

This file implements three core functions for CPU-based attention: get_scheduler_metadata computes scheduling metadata for attention work partitioning, cpu_attn_reshape_and_cache reshapes and stores key/value tensors into the paged KV cache, and cpu_attention_with_kv_cache executes the full attention computation using cached KV pairs. All functions dispatch to specialized implementations via the CPU_ATTN_DISPATCH macro based on head dimension and ISA type.

Usage

These functions are compiled into the vLLM CPU extension and called from the Python attention backend. The ISA hint string ("amx", "vec", "vec16", "neon") is selected at runtime based on detected CPU features, enabling automatic hardware-optimized attention execution.

Code Reference

Source Location

Signature

torch::Tensor get_scheduler_metadata(
    const int64_t num_req, const int64_t num_heads_q,
    const int64_t num_heads_kv, const int64_t head_dim,
    const torch::Tensor& seq_lens, at::ScalarType dtype,
    const torch::Tensor& query_start_loc, const bool casual,
    const int64_t window_size, const std::string& isa_hint,
    const bool enable_kv_split);

void cpu_attn_reshape_and_cache(
    const torch::Tensor& key, const torch::Tensor& value,
    torch::Tensor& key_cache, torch::Tensor& value_cache,
    const torch::Tensor& slot_mapping, const std::string& isa);

void cpu_attention_with_kv_cache(
    const torch::Tensor& query, const torch::Tensor& key_cache,
    const torch::Tensor& value_cache, torch::Tensor& output,
    const torch::Tensor& query_start_loc, const torch::Tensor& seq_lens,
    const double scale, const bool causal,
    const std::optional<torch::Tensor>& alibi_slopes,
    const int64_t sliding_window_left, const int64_t sliding_window_right,
    const torch::Tensor& block_table, const double softcap,
    const torch::Tensor& scheduler_metadata,
    const std::optional<torch::Tensor>& s_aux);

Import

#include "cpu_attn_dispatch_generated.h"

I/O Contract

Inputs

Name Type Required Description
query torch::Tensor Yes Query tensor [num_tokens, num_heads, head_size]
key_cache torch::Tensor Yes Paged key cache [num_blocks, num_kv_heads, block_size, head_size]
value_cache torch::Tensor Yes Paged value cache [num_blocks, num_kv_heads, block_size, head_size]
output torch::Tensor Yes Pre-allocated output tensor [num_tokens, num_heads, head_size]
seq_lens torch::Tensor Yes Per-request sequence lengths [num_req]
query_start_loc torch::Tensor Yes Start index of each request's queries [num_req + 1]
block_table torch::Tensor Yes Block table mapping requests to KV cache blocks [num_req, max_block_num]
scale double Yes Softmax scaling factor (typically 1/sqrt(head_dim))
causal bool Yes Whether to apply causal attention masking
isa_hint std::string Yes ISA selection hint: "amx", "vec", "vec16", or "neon"
alibi_slopes torch::Tensor No ALiBi attention slopes [num_heads]
sliding_window_left int64_t No Left sliding window size (-1 for no window)
softcap double No Logits soft-capping value (0 for disabled)
scheduler_metadata torch::Tensor Yes Opaque scheduling metadata from get_scheduler_metadata

Outputs

Name Type Description
output torch::Tensor Attention result written in-place [num_tokens, num_heads, head_size]
scheduler_metadata torch::Tensor Returned by get_scheduler_metadata for use in attention execution

Usage Examples

// Step 1: Get scheduler metadata
auto metadata = get_scheduler_metadata(
    num_req, num_heads_q, num_heads_kv, head_dim,
    seq_lens, torch::kBFloat16, query_start_loc,
    /*casual=*/true, /*window_size=*/-1,
    /*isa_hint=*/"amx", /*enable_kv_split=*/true);

// Step 2: Reshape and cache KV
cpu_attn_reshape_and_cache(key, value, key_cache, value_cache,
                           slot_mapping, "amx");

// Step 3: Execute attention
cpu_attention_with_kv_cache(
    query, key_cache, value_cache, output,
    query_start_loc, seq_lens, scale, /*causal=*/true,
    /*alibi_slopes=*/std::nullopt, /*sliding_window_left=*/-1,
    /*sliding_window_right=*/-1, block_table, /*softcap=*/0.0,
    metadata, /*s_aux=*/std::nullopt);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment