Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm IPEX Ops

From Leeroopedia


Knowledge Sources
Domains Intel_XPU, Attention, Quantization
Last Updated 2026-02-08 00:00 GMT

Overview

Provides Intel Extension for PyTorch (IPEX) operation wrappers for flash attention and quantized GEMM on Intel XPU hardware.

Description

This module defines the ipex_ops class with static methods that wrap Intel XPU-specific kernels for vLLM. The primary method, flash_attn_varlen_func, delegates to the vllm_xpu_kernels flash attention implementation with support for variable-length sequences, block tables for paged attention, and sliding window attention. The module also registers fake (abstract) implementations for fp8_gemm_w8a16 and int4_gemm_w4a16 custom operations via torch.library.register_fake, enabling torch.compile tracing on XPU hardware.

Usage

Use this module when running vLLM on Intel XPU/GPU hardware with IPEX backend. It is automatically loaded by vLLM's platform detection to provide hardware-specific attention and quantized linear operation implementations as alternatives to the CUDA equivalents.

Code Reference

Source Location

Signature

class ipex_ops:
    @staticmethod
    def flash_attn_varlen_func(
        q: torch.Tensor, k: torch.Tensor, v: torch.Tensor,
        cu_seqlens_q: torch.Tensor, max_seqlen_q: int, max_seqlen_k: int,
        softmax_scale: float | None = None, causal: bool = False,
        out: torch.Tensor | None = None,
        block_table: torch.Tensor | None = None,
        alibi_slopes: torch.Tensor | None = None,
        window_size: list[int] | None = None,
        softcap: float | None = 0.0,
        seqused_k: torch.Tensor | None = None,
        cu_seqlens_k: torch.Tensor | None = None,
        ...
    ) -> torch.Tensor: ...

    @staticmethod
    def get_scheduler_metadata(...) -> None: ...

# Fake op registrations (for torch.compile tracing)
def _fp8_gemm_w8a16_fake(input, q_weight, weight_scale, bias=None) -> torch.Tensor: ...
def _int4_gemm_w4a16_fake(input, q_weight, bias, weight_scale, qzeros, group_size, group_idx=None) -> torch.Tensor: ...

Import

from vllm._ipex_ops import ipex_ops

I/O Contract

Inputs

Name Type Required Description
q torch.Tensor Yes Query tensor for flash attention
k torch.Tensor Yes Key tensor for flash attention
v torch.Tensor Yes Value tensor for flash attention
cu_seqlens_q torch.Tensor Yes Cumulative sequence lengths for queries
max_seqlen_q int Yes Maximum query sequence length
max_seqlen_k int Yes Maximum key sequence length
softmax_scale float or None No Softmax scaling factor (default computed from head dim)
causal bool No Whether to apply causal masking (default False)
block_table torch.Tensor or None No Block table for paged KV cache attention
cu_seqlens_k torch.Tensor or None Conditional Cumulative key lengths (required when block_table is None)
seqused_k torch.Tensor or None Conditional Used key lengths (required when block_table is provided)

Outputs

Name Type Description
out torch.Tensor Attention output tensor with same shape as q

Usage Examples

from vllm._ipex_ops import ipex_ops

# Variable-length flash attention on Intel XPU
output = ipex_ops.flash_attn_varlen_func(
    q=query_tensor,        # [total_q, num_heads, head_dim]
    k=key_tensor,          # [total_k, num_kv_heads, head_dim]
    v=value_tensor,        # [total_k, num_kv_heads, head_dim]
    cu_seqlens_q=cu_seq_q, # [batch_size + 1]
    max_seqlen_q=512,
    max_seqlen_k=512,
    cu_seqlens_k=cu_seq_k, # [batch_size + 1]
    causal=True,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment