Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Bitsandbytes foundation Bitsandbytes Global INT8 Quantization

From Leeroopedia


Knowledge Sources
Domains Quantization, INT8, GPU_Optimization
Last Updated 2026-02-07 13:31 GMT

Overview

A quantization technique that maps an entire tensor to INT8 using a single global absolute maximum scaling factor, providing the simplest and fastest quantization at the cost of precision for tensors with non-uniform value distributions.

Description

Global INT8 quantization computes a single absolute maximum value across the entire tensor and uses it to scale all elements into the [-127, 127] range. This is the coarsest quantization granularity: a single scaling factor for the whole tensor. It is faster than per-row or per-block quantization because it requires only one reduction operation and one elementwise kernel. However, it loses precision when tensor values have varying magnitudes across rows or columns, since outlier values in one region affect the scaling for all regions. It is appropriate for weight matrices that have relatively uniform value distributions.

Usage

Apply global INT8 quantization for weight matrices in SwitchBack linear layers where per-row granularity is not needed. For activations (which tend to have more varied per-row magnitudes), rowwise quantization is preferred.

Theoretical Basis

absmax=maxi,j|Xij|

Qij=round(127Xijabsmax)

Dequantization: X^ij=Qijabsmax127

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment