Principle:Bitsandbytes foundation Bitsandbytes Mixed Precision Matmul With Outlier Decomposition
Metadata
| Field | Value |
|---|---|
| Sources | Paper: LLM.int8(), Repo: bitsandbytes |
| Domains | Quantization, Linear_Algebra |
| Last updated | 2026-02-07 14:00 GMT |
Overview
A matrix multiplication strategy that decomposes computation into INT8 and FP16 paths based on feature outlier magnitude, forming the core algorithm of LLM.int8() for preserving model quality during quantized inference.
Description
Mixed-precision matmul with outlier decomposition is the core algorithm of LLM.int8(). It addresses the fundamental challenge of quantized inference: certain feature dimensions in transformer models contain values with unusually large magnitudes (outliers) that, if naively quantized to INT8, cause significant accuracy degradation.
The algorithm proceeds in five steps during the forward pass:
Step 1 -- Quantize activations:
The input activation tensor A is quantized row-wise to INT8 using int8_vectorwise_quant. As a side effect, outlier columns are identified and suppressed (zeroed) in the quantized output.
Step 2 -- Identify outlier columns:
If threshold > 0, columns where any activation value exceeds the threshold in absolute magnitude are flagged as outlier columns. The indices of these columns are returned by the quantization function.
Step 3 -- FP16 path for outliers:
For the outlier dimensions, sub-matrices are extracted from both the activations and weights in FP16 precision. A standard FP16 matrix multiplication is performed on these sub-matrices, yielding an accurate partial result for the outlier features.
Step 4 -- INT8 path for non-outliers:
For the remaining (non-outlier) dimensions, an INT8 scaled matrix multiplication is performed. Both the quantized activations and quantized weights are in INT8, and the result is dequantized using the per-row scaling factors from both operands.
Step 5 -- Combine results:
The FP16 partial result (outliers) and the dequantized INT8 partial result (non-outliers) are summed to produce the final output tensor.
Backward pass:
For gradient computation, the INT8 weights are dequantized back to the training dtype. The gradient with respect to the input is computed via standard matmul with the dequantized weights. If has_fp16_weights=True, the gradient with respect to the weights is also computed.
Usage
Mixed-precision matmul is automatically applied during the forward pass of Linear8bitLt layers. The behavior is controlled by the threshold parameter:
- threshold = 0.0: No outlier decomposition. All computation is done in INT8 via
int8_scaled_mm. Faster but potentially less accurate for models with significant outlier features. - threshold > 0.0 (e.g., 6.0): Outlier decomposition is enabled. Features exceeding the threshold are computed in FP16 via
int8_mixed_scaled_mm. Slower due to the additional FP16 path, but significantly more accurate.
Threshold tuning:
- Higher threshold: Fewer features classified as outliers. More computation in INT8. Faster execution but potentially lower accuracy.
- Lower threshold: More features classified as outliers. More computation in FP16. Higher accuracy but slower and more memory usage.
- Default (6.0): Empirically determined to work well for most large language models, where ~0.1% of features are outliers.
Theoretical Basis
Mathematical decomposition:
The output Y of a linear layer is decomposed as:
Y = A @ W^T
Let O denote the set of outlier column indices and N the set of non-outlier column indices. The computation is split:
Y = A[:, O] @ W[:, O]^T + A[:, N] @ W[:, N]^T
|--- FP16 path ---| |--- INT8 path ---|
Validity of decomposition:
This decomposition is mathematically exact because matrix multiplication distributes over column/row splitting. For any column partition {O, N} of the shared dimension:
A @ W^T = A_O @ W_O^T + A_N @ W_N^T
The approximation error comes solely from the INT8 quantization of the non-outlier path, not from the decomposition itself.
INT8 scaled matmul:
For the non-outlier path, the dequantized result is:
Y_N[i,j] = sum_k( A_int8[i,k] * W_int8[j,k] ) * (scale_A[i] * scale_W[j]) / (127 * 127)
This is computed efficiently using INT8 GEMM hardware (e.g., Tensor Cores on NVIDIA GPUs) followed by FP32 scaling.
Outlier sparsity:
In practice, only approximately 0.1% of features in typical large language models are outliers (magnitude > 6.0). This means the FP16 overhead is minimal -- a small sub-matrix multiplication on ~0.1% of the feature dimensions -- while the accuracy benefit is significant because these outlier features carry disproportionate information.