Heuristic:Bitsandbytes foundation Bitsandbytes Outlier Threshold Detection
| Knowledge Sources | |
|---|---|
| Domains | Quantization, LLMs |
| Last Updated | 2026-02-07 13:00 GMT |
Overview
LLM.int8() outlier detection uses threshold-based decomposition: values exceeding a configurable threshold (default 6.0) are kept in fp16 while the rest are quantized to int8, preserving model accuracy.
Description
Large language models contain outlier features — activation values with magnitudes far larger than the rest. Standard INT8 quantization clips these values, causing significant accuracy loss. LLM.int8() addresses this by detecting outlier columns and handling them in full fp16 precision while quantizing the remaining values to int8. The detection uses two complementary approaches: a z-score statistical test (configurable, default zscore=3 for input layers, zscore=4 for weights) and a magnitude threshold (values > 6.0). Outliers are pooled across layers via `GlobalOutlierPooler` to improve detection consistency in smaller models.
Usage
Apply this heuristic when using 8-bit LLM.int8() inference (Linear8bitLt with `threshold > 0.0`). If you observe accuracy degradation with INT8 quantization, adjusting the threshold can help. Setting `threshold=0.0` disables outlier decomposition entirely (pure INT8). Higher thresholds detect fewer outliers; lower thresholds detect more.
The Insight (Rule of Thumb)
- Action: Set `threshold=6.0` (default) in `Linear8bitLt` for standard outlier-aware quantization.
- Value: threshold=6.0 is the recommended default. zscore=3 for input layers (aggressive), zscore=4 for weight layers (conservative).
- Trade-off: Enabling outlier decomposition (threshold > 0) adds computational overhead from mixed-precision matmul but recovers 0.5-1% accuracy. Setting threshold=0.0 disables it for maximum speed at cost of accuracy.
- Cross-layer pooling: `GlobalOutlierPooler` aggregates outlier dimensions across layers; particularly important for small models (< 70B parameters) where outlier patterns are less systematic.
- FFN skip: The pooler does NOT track outlier columns from the second FFN layer (different feature distribution).
Reasoning
The mixed-precision decomposition separates the matmul into two parts: the main INT8 matmul on non-outlier columns, and a smaller FP16 matmul on outlier columns. The results are summed. This preserves the precision of extreme values while still benefiting from INT8 acceleration for the majority of computations.
Outlier pooling from `bitsandbytes/autograd/_functions.py:15-48`:
"""
This class pools outlier dimensions across layers.
This is particularly important for small models where outlier features
are less systematic and occur with low frequency.
"""
class GlobalOutlierPooler:
def add_outliers(self, outlier_idx, feature_dim):
if self.model_dim is None:
self.model_dim = feature_dim
if feature_dim != self.model_dim:
return # we do not encode outliers for the 2nd FFN layer
self.outliers.update(outlier_idx.tolist())
Threshold-based decomposition from `bitsandbytes/autograd/_functions.py:152-171`:
# Handle sparse decomposition
if state.threshold > 0.0:
state.idx = outlier_cols
# Mixed Int8 Matmul + Dequant + Bias
output, subA = torch.ops.bitsandbytes.int8_mixed_scaled_mm(
A, CA, state.CB, SCA, state.SCB, outlier_cols, bias,
)
else:
# Int8 Matmul + Dequant + Bias
output = torch.ops.bitsandbytes.int8_scaled_mm.default(
CA, state.CB, SCA, state.SCB, bias=bias, dtype=A.dtype
)
Dual outlier detection from `bitsandbytes/utils.py:25-35`:
# (1) zscore test of std of hidden dimension with zscore=3
outlier_idx = find_outlier_dims(merged, reduction_dim=1, zscore=3)
# (2) magnitude > 6 test
dims = (torch.abs(input[0]) > 6).sum(dim=list(range(len(input[0].shape) - 1)))
outlier_idx2 = torch.where(dims > 0)[0]
outlier_idx = torch.cat([outlier_idx, outlier_idx2]).unique()