Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Bitsandbytes foundation Bitsandbytes XPU SYCL Dequantization

From Leeroopedia


Knowledge Sources
Domains XPU_Backend, SYCL, Dequantization
Last Updated 2026-02-07 13:31 GMT

Overview

Native SYCL kernel implementations for blockwise dequantization and 4-bit GEMV inference on Intel XPU hardware, using vectorized memory access and sub-group parallelism.

Description

Intel XPU devices (Data Center GPU Max, Arc) use the SYCL programming model instead of CUDA. This principle defines how bitsandbytes dequantization and GEMV operations are implemented as SYCL kernels. Key techniques include: (1) vectorized loads using sycl::vec<uint8_t, N> for coalesced memory access, (2) bit-shift block indexing replacing expensive integer division with countl_zero for power-of-two block sizes, (3) binary decision tree dequantization for NF4/FP4 lookup without table indirection, (4) sub-group reduction via sycl::reduce_over_group for the GEMV dot product accumulation, and (5) shared local memory for the 16-element quantization map in GEMV.

Usage

Apply this principle when porting GPU quantization kernels from CUDA to Intel XPU via SYCL. The same algorithmic patterns (tiled dequantization, sub-group GEMV) map naturally to SYCL with appropriate use of nd_item, sub_group, and local memory APIs.

Theoretical Basis

The dequantization kernel maps each workgroup to a tile of the quantized tensor:

# Pseudo-code for SYCL dequantize kernel
base_idx = group_id * TILE_SIZE
local_idx = local_id * NUM_PER_TH
block_idx = (base_idx + local_idx) >> log2(blocksize)  # bit-shift instead of division
absmax_val = absmax[block_idx]

for each quantized byte q in local batch:
    if DATA_TYPE == NF4:
        vals[2i] = NF4_tree(q >> 4) * absmax_val
        vals[2i+1] = NF4_tree(q & 0xF) * absmax_val

The GEMV kernel assigns one sub-group per output row, with each lane processing a chunk of the K dimension, using sub-group reduction for the final summation.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment