Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Bitsandbytes foundation Bitsandbytes Int8 Vectorwise Quantization

From Leeroopedia


Metadata

Field Value
Sources Paper: LLM.int8(), Paper: 8-Bit Approximations for Parallelism in Deep Learning, Repo: bitsandbytes
Domains Quantization
Last updated 2026-02-07 14:00 GMT

Overview

A per-row quantization scheme that maps floating-point weight tensors to INT8 using independent scaling factors per vector (row), forming the core quantization primitive of the LLM.int8() algorithm.

Description

Vectorwise quantization (also called row-wise quantization) quantizes each row of a weight matrix independently to the INT8 range [-127, 127]. Each row has its own scaling factor computed from the maximum absolute value in that row.

Key characteristics:

  • Per-row independence: Each row of the weight matrix is scaled independently. This means rows with small values get fine-grained quantization, while rows with large values use a coarser but still valid mapping.
  • Distinction from blockwise quantization: Vectorwise quantization operates on full rows of the weight matrix. This is fundamentally different from the blockwise quantization used in 4-bit (NF4/FP4) schemes, where the weight tensor is divided into fixed-size blocks (e.g., 64 elements) regardless of row boundaries.
  • Precision preservation: Because each row has its own scale, vectorwise quantization preserves more precision for weight matrices with rows of varying magnitudes compared to a global scaling approach.
  • Outlier decomposition (optional): When a threshold parameter is provided (threshold > 0), the quantization function also performs sparse decomposition. It identifies columns where any element exceeds the threshold in absolute value. These outlier columns are extracted and returned separately for FP16 computation, while the remaining columns are quantized to INT8.

The quantization is applied in two contexts:

  1. Weight quantization: Applied once during model loading / device transfer. The INT8 weights and scaling factors are stored and reused across forward passes.
  2. Activation quantization: Applied during each forward pass to quantize the input activations before INT8 matrix multiplication.

Usage

Vectorwise INT8 quantization is applied internally in the following scenarios:

  • During weight transfer to GPU: When an 8-bit model layer (Linear8bitLt) is moved to a CUDA device, its Int8Params._quantize() method calls int8_vectorwise_quant to quantize the weights.
  • During forward pass activation quantization: The MatMul8bitLt.forward() method calls int8_vectorwise_quant on the input activations to quantize them before INT8 matmul.
  • During backward pass weight quantization: When has_fp16_weights=True and the model is in training mode, weights may be re-quantized each forward pass.

Theoretical Basis

Row-wise quantization formula:

For each row i of a matrix W:

scale_i = max(|W_i|) / 127
W_int8[i] = round(W[i] / scale_i * 127)

Or equivalently:

scale_i = max(|W_i|) / 127
W_int8[i,j] = round(W[i,j] * 127 / max(|W_i|))

Dequantization:

W_approx[i,j] = W_int8[i,j] * scale_i / 127

Outlier decomposition (when threshold > 0):

  1. Identify outlier columns: outlier_cols = {j : exists i such that |W[i,j]| > threshold}
  2. Extract outlier sub-matrix in FP16: W_outlier = W[:, outlier_cols] (kept in FP16)
  3. Suppress outliers in the main matrix: W[:, outlier_cols] = 0
  4. Quantize the remaining (outlier-suppressed) matrix to INT8
  5. Return the INT8 tensor, per-row scales, and the list of outlier column indices

Quantization error bound:

For a single row with n elements and dynamic range R = max(|W_i|), the maximum quantization error per element is:

error_max = R / (2 * 127) = max(|W_i|) / 254

This is tighter than global quantization where the range is determined by the entire matrix.

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment