Implementation:Bitsandbytes foundation Bitsandbytes CPU Ops Header
| Knowledge Sources | |
|---|---|
| Domains | CPU_Backend, Dequantization, SIMD |
| Last Updated | 2026-02-07 13:31 GMT |
Overview
C++ header providing CPU-specific SIMD utilities, data type conversions, NF4/FP4 dequantization lookup trees, and parallelized dequantization function declarations for the CPU backend.
Description
The cpu_ops.h header is the foundation of the bitsandbytes CPU backend. It provides: (1) compile-time AVX512/AVX512-BF16 feature detection for runtime dispatch, (2) software implementations of FP16 and BF16 conversions with correct rounding, (3) binary-tree-based NF4 and FP4 dequantization functions that map 4-bit codes to float values, (4) an OpenMP-based 2D parallel tiling utility for efficient multi-threaded dequantization, and (5) template declarations for blockwise 8-bit and 4-bit dequantization kernels and 4-bit GEMV inference with AVX512-BF16.
Usage
This header is included by csrc/cpu_ops.cpp which implements the declared templates. It provides the low-level building blocks used when bitsandbytes runs on CPU-only systems or when CPU fallback is needed for operations not dispatched to a GPU backend.
Code Reference
Source Location
- Repository: bitsandbytes
- File: csrc/cpu_ops.h
- Lines: 1-334
Signature
// Key functions declared/defined:
inline float dDequantizeFP4(unsigned char val);
inline float dDequantizeNF4(unsigned char val);
void quantize_cpu(float* code, float* A, float* absmax,
unsigned char* out, long long blocksize, long long n);
template <typename T>
void dequantizeBlockwise8bitCpu(float* code, unsigned char* A,
const float* absmax, T* out, long long blocksize, long long n);
template <typename T, int DATA_TYPE>
void dequantizeBlockwise4bitCpu(unsigned char* A, const float* absmax,
T* out, long long blocksize, long long m, long long n);
template <typename T, int DATA_TYPE>
void gemv_4bit_inference(int64_t M, int64_t N, int64_t K,
const T* x, const unsigned char* w, const T* absmax,
T* out, int64_t blocksize, int64_t x_stride, int64_t out_stride);
// Utility functions:
static inline bf16_t float_to_bf16(float x);
static float bf16_to_float(uint16_t bf16);
static inline fp16_t float_to_fp16(float x);
static inline bool has_avx512f();
template <typename func_t> inline void parallel_2d(int m, int n, const func_t& f);
Import
#include "cpu_ops.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| code | float* | Yes (8-bit) | Quantization codebook (256 entries for 8-bit) |
| A | unsigned char* | Yes | Quantized input tensor |
| absmax | float* | Yes | Per-block absolute maximum scaling factors |
| blocksize | long long/int64_t | Yes | Number of elements per quantization block |
| n | long long | Yes | Total number of elements |
Outputs
| Name | Type | Description |
|---|---|---|
| out | T* | Dequantized output tensor (float, fp16, or bf16) |
Usage Examples
Binary Tree NF4 Dequantization
// The NF4 dequantization uses a binary tree lookup:
// 4-bit value 0b1111 -> 1.0f (maximum positive)
// 4-bit value 0b0000 -> -1.0f (maximum negative)
// 4-bit value 0b0111 -> 0.0f (zero)
float val = dDequantizeNF4(0b1111); // returns 1.0f
float val2 = dDequantizeNF4(0b0111); // returns 0.0f
Parallel 2D Tiling
// Distribute M x N work across OpenMP threads with square tiling
parallel_2d(M, N, [&](int begin_m, int end_m, int begin_n, int end_n) {
for (int i = begin_m; i < end_m; i++)
for (int j = begin_n; j < end_n; j++)
process(i, j);
});