Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Bitsandbytes foundation Bitsandbytes CPU Ops Header

From Leeroopedia
Revision as of 14:34, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Bitsandbytes_foundation_Bitsandbytes_CPU_Ops_Header.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains CPU_Backend, Dequantization, SIMD
Last Updated 2026-02-07 13:31 GMT

Overview

C++ header providing CPU-specific SIMD utilities, data type conversions, NF4/FP4 dequantization lookup trees, and parallelized dequantization function declarations for the CPU backend.

Description

The cpu_ops.h header is the foundation of the bitsandbytes CPU backend. It provides: (1) compile-time AVX512/AVX512-BF16 feature detection for runtime dispatch, (2) software implementations of FP16 and BF16 conversions with correct rounding, (3) binary-tree-based NF4 and FP4 dequantization functions that map 4-bit codes to float values, (4) an OpenMP-based 2D parallel tiling utility for efficient multi-threaded dequantization, and (5) template declarations for blockwise 8-bit and 4-bit dequantization kernels and 4-bit GEMV inference with AVX512-BF16.

Usage

This header is included by csrc/cpu_ops.cpp which implements the declared templates. It provides the low-level building blocks used when bitsandbytes runs on CPU-only systems or when CPU fallback is needed for operations not dispatched to a GPU backend.

Code Reference

Source Location

Signature

// Key functions declared/defined:
inline float dDequantizeFP4(unsigned char val);
inline float dDequantizeNF4(unsigned char val);

void quantize_cpu(float* code, float* A, float* absmax,
                  unsigned char* out, long long blocksize, long long n);

template <typename T>
void dequantizeBlockwise8bitCpu(float* code, unsigned char* A,
    const float* absmax, T* out, long long blocksize, long long n);

template <typename T, int DATA_TYPE>
void dequantizeBlockwise4bitCpu(unsigned char* A, const float* absmax,
    T* out, long long blocksize, long long m, long long n);

template <typename T, int DATA_TYPE>
void gemv_4bit_inference(int64_t M, int64_t N, int64_t K,
    const T* x, const unsigned char* w, const T* absmax,
    T* out, int64_t blocksize, int64_t x_stride, int64_t out_stride);

// Utility functions:
static inline bf16_t float_to_bf16(float x);
static float bf16_to_float(uint16_t bf16);
static inline fp16_t float_to_fp16(float x);
static inline bool has_avx512f();
template <typename func_t> inline void parallel_2d(int m, int n, const func_t& f);

Import

#include "cpu_ops.h"

I/O Contract

Inputs

Name Type Required Description
code float* Yes (8-bit) Quantization codebook (256 entries for 8-bit)
A unsigned char* Yes Quantized input tensor
absmax float* Yes Per-block absolute maximum scaling factors
blocksize long long/int64_t Yes Number of elements per quantization block
n long long Yes Total number of elements

Outputs

Name Type Description
out T* Dequantized output tensor (float, fp16, or bf16)

Usage Examples

Binary Tree NF4 Dequantization

// The NF4 dequantization uses a binary tree lookup:
// 4-bit value 0b1111 -> 1.0f (maximum positive)
// 4-bit value 0b0000 -> -1.0f (maximum negative)
// 4-bit value 0b0111 -> 0.0f (zero)
float val = dDequantizeNF4(0b1111);  // returns 1.0f
float val2 = dDequantizeNF4(0b0111); // returns 0.0f

Parallel 2D Tiling

// Distribute M x N work across OpenMP threads with square tiling
parallel_2d(M, N, [&](int begin_m, int end_m, int begin_n, int end_n) {
    for (int i = begin_m; i < end_m; i++)
        for (int j = begin_n; j < end_n; j++)
            process(i, j);
});

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment