Principle:Bitsandbytes foundation Bitsandbytes Global INT8 Quantization
| Knowledge Sources | |
|---|---|
| Domains | Quantization, INT8, GPU_Optimization |
| Last Updated | 2026-02-07 13:31 GMT |
Overview
A quantization technique that maps an entire tensor to INT8 using a single global absolute maximum scaling factor, providing the simplest and fastest quantization at the cost of precision for tensors with non-uniform value distributions.
Description
Global INT8 quantization computes a single absolute maximum value across the entire tensor and uses it to scale all elements into the [-127, 127] range. This is the coarsest quantization granularity: a single scaling factor for the whole tensor. It is faster than per-row or per-block quantization because it requires only one reduction operation and one elementwise kernel. However, it loses precision when tensor values have varying magnitudes across rows or columns, since outlier values in one region affect the scaling for all regions. It is appropriate for weight matrices that have relatively uniform value distributions.
Usage
Apply global INT8 quantization for weight matrices in SwitchBack linear layers where per-row granularity is not needed. For activations (which tend to have more varied per-row magnitudes), rowwise quantization is preferred.
Theoretical Basis
Dequantization: