Implementation:Bitsandbytes foundation Bitsandbytes XPU SYCL Kernels

Knowledge Sources	Bitsandbytes
Domains	XPU_Backend, SYCL, Dequantization, GEMV
Last Updated	2026-02-07 13:31 GMT

Overview

Intel XPU SYCL kernel implementations for blockwise dequantization and 4-bit quantized matrix-vector multiplication, providing native GPU acceleration on Intel data center and consumer GPUs.

Description

These files implement the core SYCL (oneAPI) kernels for the bitsandbytes XPU backend. kDequantizeBlockwise is a templated SYCL kernel that dequantizes tensors from 8-bit (General8bit) or packed 4-bit (NF4, FP4) formats back to half, bfloat16, or float precision. It uses vectorized loads via sycl::vec, binary decision tree lookups for NF4/FP4, and bit-shift-based block index computation (avoiding expensive division). kgemv_4bit_inference performs 4-bit quantized matrix-vector multiplication using sub-group parallelism, shared memory for the quantization lookup map, and sycl::reduce_over_group for the final reduction. The xpu_ops.cpp file provides C++ launch wrappers that configure workgroup sizes and submit the kernels to SYCL queues.

Usage

These kernels are called by the Python XPU backend (bitsandbytes/backends/xpu/ops.py) through the bitsandbytes native library interface. They are dispatched automatically when running on Intel XPU devices with the SYCL native library available.

Code Reference

Source Location

Repository: bitsandbytes
File: csrc/xpu_kernels.cpp
Lines: 1-282
File: csrc/xpu_ops.cpp
Lines: 1-103

Signature

// SYCL Kernels (xpu_kernels.cpp)
template <typename T, int TILE_SIZE, int NUM_PER_TH, int DATA_TYPE>
struct kDequantizeBlockwise {
    void operator()(sycl::nd_item<1> item) const;
    // Members: float* code, unsigned char* A, float* absmax, T* out, int blocksize, int n
};

template <typename T, size_t GROUP_SIZE, size_t NUM_PER_THREAD, size_t SUBG_SIZE, int BITS>
struct kgemv_4bit_inference {
    void operator()(sycl::nd_item<1> item) const;
    // Members: int m, n, k; T* A; unsigned char* B; float* absmax, datatype; T* out; int lda, ldb, ldc, blocksize
};

// Launch Wrappers (xpu_ops.cpp)
template <typename T, int DATA_TYPE>
void dequantizeBlockwise(
    float* code, unsigned char* A, float* absmax, T* out,
    int blocksize, const int n, sycl::queue* stream);

template <typename T, int BITS>
void gemv_4bit_inference(
    int m, int n, int k, T* A, unsigned char* B, float* absmax,
    float* datatype, T* out, int lda, int ldb, int ldc,
    int blocksize, sycl::queue* stream);

Import

#include "xpu_kernels.h"
#include "xpu_ops.h"

I/O Contract

Inputs

Name	Type	Required	Description
code	float*	Yes (8-bit)	Quantization codebook (256 entries)
A	unsigned char*	Yes	Quantized input tensor
absmax	float*	Yes	Per-block absolute maximum scaling factors
blocksize	int	Yes	Elements per quantization block
n	int	Yes	Total number of elements
stream	sycl::queue*	Yes	SYCL command queue for kernel submission

Outputs

Name	Type	Description
out	T*	Dequantized output (half, bfloat16, or float)

Usage Examples

Template Instantiations

// Available instantiations:
// Dequantize for all dtype x quant_type combinations
template class kDequantizeBlockwise<sycl::half, 512, 4, NF4>;
template class kDequantizeBlockwise<sycl::half, 512, 4, FP4>;
template class kDequantizeBlockwise<sycl::half, 512, 4, General8bit>;
template class kDequantizeBlockwise<float, 512, 4, NF4>;
// ... (also bfloat16)

// GEMV for half (16-bit elements) and float (32-bit elements)
template class kgemv_4bit_inference<sycl::half, 128, 4, 32, 16>;
template class kgemv_4bit_inference<sycl::ext::oneapi::bfloat16, 128, 4, 32, 16>;
template class kgemv_4bit_inference<float, 128, 4, 32, 32>;

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment