Principle:Bitsandbytes foundation Bitsandbytes 8bit Quantization Configuration

Metadata

Field	Value
Sources	Paper: LLM.int8(), Repo: bitsandbytes
Domains	Quantization, NLP
Last updated	2026-02-07 14:00 GMT

Overview

Configuration of 8-bit integer quantization parameters for memory-efficient model inference using the LLM.int8() mixed-precision decomposition scheme.

Description

LLM.int8() uses a mixed-precision decomposition approach to quantize large language models for inference with minimal accuracy loss. The scheme works as follows:

Row-wise (vectorwise) INT8 quantization: Most weights in the model are quantized to INT8 precision. Each row of a weight matrix is independently scaled to the INT8 range [-127, 127] using its maximum absolute value as the scaling factor.
Outlier feature handling: Certain feature columns contain values with unusually large magnitudes (outliers). These columns are identified by comparing against a configurable threshold (default 6.0). Outlier columns are kept in FP16 precision to preserve model accuracy.
Threshold parameter (llm_int8_threshold): Controls the sensitivity of outlier detection. A lower threshold classifies more features as outliers (more FP16, higher accuracy, more memory). A higher threshold classifies fewer features as outliers (more INT8, lower memory, potentially less accuracy).
FP16 weight retention (llm_int8_has_fp16_weight): Controls whether FP16 copies of the weights are retained alongside the INT8 quantized weights. When set to True, the original FP16 weights are kept in memory, enabling fine-tuning of the quantized model. When False (default for inference), only the INT8 weights and scaling factors are stored, minimizing memory usage.

Usage

8-bit quantization configuration is used when deploying large language models that barely fit in available GPU memory. It is typically applied for inference workloads where the goal is to reduce the model memory footprint by approximately 50% (FP16 to INT8) while preserving model quality.

Compared to 4-bit quantization, 8-bit LLM.int8() provides:

Less aggressive compression (roughly 2x vs 4x reduction)
Better accuracy preservation due to the mixed-precision outlier decomposition
No need for calibration data
Straightforward integration via configuration objects

Theoretical Basis

The theoretical foundation of 8-bit quantization configuration rests on three key concepts:

1. Vectorwise INT8 quantization:

For each row i of a weight matrix W:

Compute the scaling factor: scale_i = max(|W_i|) / 127
Quantize each element: W_int8[i,j] = round(W[i,j] / scale_i * 127)
Store the INT8 values and the per-row scaling factors separately

2. Sparse outlier decomposition:

Identify outlier columns j where any element exceeds the threshold:

outlier_cols = {j : exists i such that |W[i,j]| > threshold}
Extract these columns into a separate FP16 sub-matrix
The remaining columns are quantized to INT8

3. Combined computation:

The final output combines both precision paths:

output = INT8_matmul(activations_non_outlier, weights_non_outlier)
       + FP16_matmul(activations_outlier, weights_outlier)

In practice, only approximately 0.1% of features in typical large language models are outliers, so the FP16 overhead is minimal while the accuracy benefit is significant.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment