Principle:Bitsandbytes foundation Bitsandbytes 8bit Quantization Configuration
Metadata
| Field | Value |
|---|---|
| Sources | Paper: LLM.int8(), Repo: bitsandbytes |
| Domains | Quantization, NLP |
| Last updated | 2026-02-07 14:00 GMT |
Overview
Configuration of 8-bit integer quantization parameters for memory-efficient model inference using the LLM.int8() mixed-precision decomposition scheme.
Description
LLM.int8() uses a mixed-precision decomposition approach to quantize large language models for inference with minimal accuracy loss. The scheme works as follows:
- Row-wise (vectorwise) INT8 quantization: Most weights in the model are quantized to INT8 precision. Each row of a weight matrix is independently scaled to the INT8 range [-127, 127] using its maximum absolute value as the scaling factor.
- Outlier feature handling: Certain feature columns contain values with unusually large magnitudes (outliers). These columns are identified by comparing against a configurable threshold (default 6.0). Outlier columns are kept in FP16 precision to preserve model accuracy.
- Threshold parameter (
llm_int8_threshold): Controls the sensitivity of outlier detection. A lower threshold classifies more features as outliers (more FP16, higher accuracy, more memory). A higher threshold classifies fewer features as outliers (more INT8, lower memory, potentially less accuracy). - FP16 weight retention (
llm_int8_has_fp16_weight): Controls whether FP16 copies of the weights are retained alongside the INT8 quantized weights. When set toTrue, the original FP16 weights are kept in memory, enabling fine-tuning of the quantized model. WhenFalse(default for inference), only the INT8 weights and scaling factors are stored, minimizing memory usage.
Usage
8-bit quantization configuration is used when deploying large language models that barely fit in available GPU memory. It is typically applied for inference workloads where the goal is to reduce the model memory footprint by approximately 50% (FP16 to INT8) while preserving model quality.
Compared to 4-bit quantization, 8-bit LLM.int8() provides:
- Less aggressive compression (roughly 2x vs 4x reduction)
- Better accuracy preservation due to the mixed-precision outlier decomposition
- No need for calibration data
- Straightforward integration via configuration objects
Theoretical Basis
The theoretical foundation of 8-bit quantization configuration rests on three key concepts:
1. Vectorwise INT8 quantization:
For each row i of a weight matrix W:
- Compute the scaling factor:
scale_i = max(|W_i|) / 127 - Quantize each element:
W_int8[i,j] = round(W[i,j] / scale_i * 127) - Store the INT8 values and the per-row scaling factors separately
2. Sparse outlier decomposition:
Identify outlier columns j where any element exceeds the threshold:
outlier_cols = {j : exists i such that |W[i,j]| > threshold}- Extract these columns into a separate FP16 sub-matrix
- The remaining columns are quantized to INT8
3. Combined computation:
The final output combines both precision paths:
output = INT8_matmul(activations_non_outlier, weights_non_outlier)
+ FP16_matmul(activations_outlier, weights_outlier)
In practice, only approximately 0.1% of features in typical large language models are outliers, so the FP16 overhead is minimal while the accuracy benefit is significant.