Implementation:Bitsandbytes foundation Bitsandbytes XPU SYCL Kernels
| Knowledge Sources | |
|---|---|
| Domains | XPU_Backend, SYCL, Dequantization, GEMV |
| Last Updated | 2026-02-07 13:31 GMT |
Overview
Intel XPU SYCL kernel implementations for blockwise dequantization and 4-bit quantized matrix-vector multiplication, providing native GPU acceleration on Intel data center and consumer GPUs.
Description
These files implement the core SYCL (oneAPI) kernels for the bitsandbytes XPU backend. kDequantizeBlockwise is a templated SYCL kernel that dequantizes tensors from 8-bit (General8bit) or packed 4-bit (NF4, FP4) formats back to half, bfloat16, or float precision. It uses vectorized loads via sycl::vec, binary decision tree lookups for NF4/FP4, and bit-shift-based block index computation (avoiding expensive division). kgemv_4bit_inference performs 4-bit quantized matrix-vector multiplication using sub-group parallelism, shared memory for the quantization lookup map, and sycl::reduce_over_group for the final reduction. The xpu_ops.cpp file provides C++ launch wrappers that configure workgroup sizes and submit the kernels to SYCL queues.
Usage
These kernels are called by the Python XPU backend (bitsandbytes/backends/xpu/ops.py) through the bitsandbytes native library interface. They are dispatched automatically when running on Intel XPU devices with the SYCL native library available.
Code Reference
Source Location
- Repository: bitsandbytes
- File: csrc/xpu_kernels.cpp
- Lines: 1-282
- File: csrc/xpu_ops.cpp
- Lines: 1-103
Signature
// SYCL Kernels (xpu_kernels.cpp)
template <typename T, int TILE_SIZE, int NUM_PER_TH, int DATA_TYPE>
struct kDequantizeBlockwise {
void operator()(sycl::nd_item<1> item) const;
// Members: float* code, unsigned char* A, float* absmax, T* out, int blocksize, int n
};
template <typename T, size_t GROUP_SIZE, size_t NUM_PER_THREAD, size_t SUBG_SIZE, int BITS>
struct kgemv_4bit_inference {
void operator()(sycl::nd_item<1> item) const;
// Members: int m, n, k; T* A; unsigned char* B; float* absmax, datatype; T* out; int lda, ldb, ldc, blocksize
};
// Launch Wrappers (xpu_ops.cpp)
template <typename T, int DATA_TYPE>
void dequantizeBlockwise(
float* code, unsigned char* A, float* absmax, T* out,
int blocksize, const int n, sycl::queue* stream);
template <typename T, int BITS>
void gemv_4bit_inference(
int m, int n, int k, T* A, unsigned char* B, float* absmax,
float* datatype, T* out, int lda, int ldb, int ldc,
int blocksize, sycl::queue* stream);
Import
#include "xpu_kernels.h"
#include "xpu_ops.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| code | float* | Yes (8-bit) | Quantization codebook (256 entries) |
| A | unsigned char* | Yes | Quantized input tensor |
| absmax | float* | Yes | Per-block absolute maximum scaling factors |
| blocksize | int | Yes | Elements per quantization block |
| n | int | Yes | Total number of elements |
| stream | sycl::queue* | Yes | SYCL command queue for kernel submission |
Outputs
| Name | Type | Description |
|---|---|---|
| out | T* | Dequantized output (half, bfloat16, or float) |
Usage Examples
Template Instantiations
// Available instantiations:
// Dequantize for all dtype x quant_type combinations
template class kDequantizeBlockwise<sycl::half, 512, 4, NF4>;
template class kDequantizeBlockwise<sycl::half, 512, 4, FP4>;
template class kDequantizeBlockwise<sycl::half, 512, 4, General8bit>;
template class kDequantizeBlockwise<float, 512, 4, NF4>;
// ... (also bfloat16)
// GEMV for half (16-bit elements) and float (32-bit elements)
template class kgemv_4bit_inference<sycl::half, 128, 4, 32, 16>;
template class kgemv_4bit_inference<sycl::ext::oneapi::bfloat16, 128, 4, 32, 16>;
template class kgemv_4bit_inference<float, 128, 4, 32, 32>;