Principle:Bitsandbytes foundation Bitsandbytes Global INT8 Quantization

Knowledge Sources	Bitsandbytes
Domains	Quantization, INT8, GPU_Optimization
Last Updated	2026-02-07 13:31 GMT

Overview

A quantization technique that maps an entire tensor to INT8 using a single global absolute maximum scaling factor, providing the simplest and fastest quantization at the cost of precision for tensors with non-uniform value distributions.

Description

Global INT8 quantization computes a single absolute maximum value across the entire tensor and uses it to scale all elements into the [-127, 127] range. This is the coarsest quantization granularity: a single scaling factor for the whole tensor. It is faster than per-row or per-block quantization because it requires only one reduction operation and one elementwise kernel. However, it loses precision when tensor values have varying magnitudes across rows or columns, since outlier values in one region affect the scaling for all regions. It is appropriate for weight matrices that have relatively uniform value distributions.

Usage

Apply global INT8 quantization for weight matrices in SwitchBack linear layers where per-row granularity is not needed. For activations (which tend to have more varied per-row magnitudes), rowwise quantization is preferred.

Theoretical Basis

$absmax = \max_{i, j} | X_{i j} |$

$Q_{i j} = round (127 \cdot \frac{X_{i j}}{absmax})$

Dequantization: ${\hat{X}}_{i j} = Q_{i j} \cdot \frac{absmax}{127}$

Related Pages

Implementation:Bitsandbytes_foundation_Bitsandbytes_Quantize_Global

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment