Implementation:Vllm project Vllm IPEX Ops
| Knowledge Sources | |
|---|---|
| Domains | Intel_XPU, Attention, Quantization |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Provides Intel Extension for PyTorch (IPEX) operation wrappers for flash attention and quantized GEMM on Intel XPU hardware.
Description
This module defines the ipex_ops class with static methods that wrap Intel XPU-specific kernels for vLLM. The primary method, flash_attn_varlen_func, delegates to the vllm_xpu_kernels flash attention implementation with support for variable-length sequences, block tables for paged attention, and sliding window attention. The module also registers fake (abstract) implementations for fp8_gemm_w8a16 and int4_gemm_w4a16 custom operations via torch.library.register_fake, enabling torch.compile tracing on XPU hardware.
Usage
Use this module when running vLLM on Intel XPU/GPU hardware with IPEX backend. It is automatically loaded by vLLM's platform detection to provide hardware-specific attention and quantized linear operation implementations as alternatives to the CUDA equivalents.
Code Reference
Source Location
- Repository: vllm
- File: vllm/_ipex_ops.py
- Lines: 1-158
Signature
class ipex_ops:
@staticmethod
def flash_attn_varlen_func(
q: torch.Tensor, k: torch.Tensor, v: torch.Tensor,
cu_seqlens_q: torch.Tensor, max_seqlen_q: int, max_seqlen_k: int,
softmax_scale: float | None = None, causal: bool = False,
out: torch.Tensor | None = None,
block_table: torch.Tensor | None = None,
alibi_slopes: torch.Tensor | None = None,
window_size: list[int] | None = None,
softcap: float | None = 0.0,
seqused_k: torch.Tensor | None = None,
cu_seqlens_k: torch.Tensor | None = None,
...
) -> torch.Tensor: ...
@staticmethod
def get_scheduler_metadata(...) -> None: ...
# Fake op registrations (for torch.compile tracing)
def _fp8_gemm_w8a16_fake(input, q_weight, weight_scale, bias=None) -> torch.Tensor: ...
def _int4_gemm_w4a16_fake(input, q_weight, bias, weight_scale, qzeros, group_size, group_idx=None) -> torch.Tensor: ...
Import
from vllm._ipex_ops import ipex_ops
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| q | torch.Tensor | Yes | Query tensor for flash attention |
| k | torch.Tensor | Yes | Key tensor for flash attention |
| v | torch.Tensor | Yes | Value tensor for flash attention |
| cu_seqlens_q | torch.Tensor | Yes | Cumulative sequence lengths for queries |
| max_seqlen_q | int | Yes | Maximum query sequence length |
| max_seqlen_k | int | Yes | Maximum key sequence length |
| softmax_scale | float or None | No | Softmax scaling factor (default computed from head dim) |
| causal | bool | No | Whether to apply causal masking (default False) |
| block_table | torch.Tensor or None | No | Block table for paged KV cache attention |
| cu_seqlens_k | torch.Tensor or None | Conditional | Cumulative key lengths (required when block_table is None) |
| seqused_k | torch.Tensor or None | Conditional | Used key lengths (required when block_table is provided) |
Outputs
| Name | Type | Description |
|---|---|---|
| out | torch.Tensor | Attention output tensor with same shape as q |
Usage Examples
from vllm._ipex_ops import ipex_ops
# Variable-length flash attention on Intel XPU
output = ipex_ops.flash_attn_varlen_func(
q=query_tensor, # [total_q, num_heads, head_dim]
k=key_tensor, # [total_k, num_kv_heads, head_dim]
v=value_tensor, # [total_k, num_kv_heads, head_dim]
cu_seqlens_q=cu_seq_q, # [batch_size + 1]
max_seqlen_q=512,
max_seqlen_k=512,
cu_seqlens_k=cu_seq_k, # [batch_size + 1]
causal=True,
)