Implementation:Vllm project Vllm IPEX Ops

Knowledge Sources	vllm
Domains	Intel_XPU, Attention, Quantization
Last Updated	2026-02-08 00:00 GMT

Overview

Provides Intel Extension for PyTorch (IPEX) operation wrappers for flash attention and quantized GEMM on Intel XPU hardware.

Description

This module defines the ipex_ops class with static methods that wrap Intel XPU-specific kernels for vLLM. The primary method, flash_attn_varlen_func, delegates to the vllm_xpu_kernels flash attention implementation with support for variable-length sequences, block tables for paged attention, and sliding window attention. The module also registers fake (abstract) implementations for fp8_gemm_w8a16 and int4_gemm_w4a16 custom operations via torch.library.register_fake, enabling torch.compile tracing on XPU hardware.

Usage

Use this module when running vLLM on Intel XPU/GPU hardware with IPEX backend. It is automatically loaded by vLLM's platform detection to provide hardware-specific attention and quantized linear operation implementations as alternatives to the CUDA equivalents.

Code Reference

Source Location

Repository: vllm
File: vllm/_ipex_ops.py
Lines: 1-158

Signature

class ipex_ops:
    @staticmethod
    def flash_attn_varlen_func(
        q: torch.Tensor, k: torch.Tensor, v: torch.Tensor,
        cu_seqlens_q: torch.Tensor, max_seqlen_q: int, max_seqlen_k: int,
        softmax_scale: float | None = None, causal: bool = False,
        out: torch.Tensor | None = None,
        block_table: torch.Tensor | None = None,
        alibi_slopes: torch.Tensor | None = None,
        window_size: list[int] | None = None,
        softcap: float | None = 0.0,
        seqused_k: torch.Tensor | None = None,
        cu_seqlens_k: torch.Tensor | None = None,
        ...
    ) -> torch.Tensor: ...

    @staticmethod
    def get_scheduler_metadata(...) -> None: ...

# Fake op registrations (for torch.compile tracing)
def _fp8_gemm_w8a16_fake(input, q_weight, weight_scale, bias=None) -> torch.Tensor: ...
def _int4_gemm_w4a16_fake(input, q_weight, bias, weight_scale, qzeros, group_size, group_idx=None) -> torch.Tensor: ...

Import

from vllm._ipex_ops import ipex_ops

I/O Contract

Inputs

Name	Type	Required	Description
q	torch.Tensor	Yes	Query tensor for flash attention
k	torch.Tensor	Yes	Key tensor for flash attention
v	torch.Tensor	Yes	Value tensor for flash attention
cu_seqlens_q	torch.Tensor	Yes	Cumulative sequence lengths for queries
max_seqlen_q	int	Yes	Maximum query sequence length
max_seqlen_k	int	Yes	Maximum key sequence length
softmax_scale	float or None	No	Softmax scaling factor (default computed from head dim)
causal	bool	No	Whether to apply causal masking (default False)
block_table	torch.Tensor or None	No	Block table for paged KV cache attention
cu_seqlens_k	torch.Tensor or None	Conditional	Cumulative key lengths (required when block_table is None)
seqused_k	torch.Tensor or None	Conditional	Used key lengths (required when block_table is provided)

Outputs

Name	Type	Description
out	torch.Tensor	Attention output tensor with same shape as q

Usage Examples

from vllm._ipex_ops import ipex_ops

# Variable-length flash attention on Intel XPU
output = ipex_ops.flash_attn_varlen_func(
    q=query_tensor,        # [total_q, num_heads, head_dim]
    k=key_tensor,          # [total_k, num_kv_heads, head_dim]
    v=value_tensor,        # [total_k, num_kv_heads, head_dim]
    cu_seqlens_q=cu_seq_q, # [batch_size + 1]
    max_seqlen_q=512,
    max_seqlen_k=512,
    cu_seqlens_k=cu_seq_k, # [batch_size + 1]
    causal=True,
)

Related Pages

Environment:Vllm_project_Vllm_Intel_XPU

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment