Principle:Bitsandbytes foundation Bitsandbytes LLM Int8 Linear Layer
Metadata
| Field | Value |
|---|---|
| Sources | Paper: LLM.int8(), Repo: bitsandbytes |
| Domains | Quantization, Model_Architecture |
| Last updated | 2026-02-07 14:00 GMT |
Overview
A quantized neural network layer that stores weights in INT8 precision and uses mixed-precision decomposition for outlier features, serving as the drop-in replacement for standard nn.Linear layers in the LLM.int8() inference scheme.
Description
The 8-bit linear layer replaces torch.nn.Linear with a layer that stores weights as Int8Params -- a custom parameter class that automatically quantizes weights when transferred to a CUDA device.
The key mechanisms are:
- Weight storage as Int8Params: On construction, the layer wraps its weight tensor in an
Int8Paramsobject. This custom parameter class overrides the.to(device)method to trigger row-wise INT8 quantization when the tensor is moved to a GPU. - Automatic quantization on device transfer: When
Int8Params._quantize()is called (via.to(device)), it usesint8_vectorwise_quantto quantize the weight matrix. This produces the quantized weight tensor (CB) and per-row scaling factors (SCB), which are stored as attributes on the parameter. - Per-layer state tracking via MatmulLtState: Each layer maintains a
MatmulLtStatedataclass instance that tracks:CB: The INT8 quantized weight matrixSCB: Per-row scaling factors (float32)threshold: The outlier detection thresholdhas_fp16_weights: Whether FP16 copies are retainedis_training: Current training/inference modeidx: Outlier column indices (when threshold > 0)
- Optional FP16 weight retention: Unlike 4-bit layers, 8-bit layers can optionally keep FP16 weight copies (
has_fp16_weights=True). This enables fine-tuning of quantized models because gradients can be computed and applied to the FP16 weights, while the INT8 version is recomputed for the forward pass.
During the forward pass, the layer delegates computation to bnb.matmul(), which dispatches to the MatMul8bitLt autograd function for mixed-precision INT8/FP16 matrix multiplication.
Usage
The 8-bit linear layer is used when you need mixed-precision inference with good accuracy preservation. It is particularly effective for large language models where approximately 0.1% of features are outliers.
Typical use cases include:
- Inference-only deployment: Set
has_fp16_weights=Falseandthreshold=6.0for maximum memory savings with outlier handling. - Fine-tuning with quantized weights: Set
has_fp16_weights=Trueto retain FP16 weights for gradient computation. - Memory-constrained environments: When 4-bit quantization is too aggressive but full FP16 does not fit in memory.
Theoretical Basis
The 8-bit linear layer is grounded in row-wise INT8 quantization:
Per-row quantization:
Each row of the weight matrix is quantized independently. For row i:
scale_i = max(|W_i|) / 127
W_int8[i] = round(W[i] * 127 / scale_i)
This row-wise approach preserves more precision than global quantization because each row uses its own dynamic range.
MatmulLtState tracking:
The MatmulLtState dataclass maintains per-layer state that persists across forward passes:
- CB (quantized weights):
torch.Tensorwith dtypetorch.int8 - SCB (scaling factors):
torch.Tensorwith dtypetorch.float32, one value per row - Outlier indices: Column indices where feature magnitudes exceed the threshold
The state object avoids redundant re-quantization of weights across forward passes when has_fp16_weights=False, since the quantized representation is computed once during device transfer and then reused.
Dequantization for backward pass:
When gradients are needed, the INT8 weights are dequantized:
W_fp16[i] = W_int8[i] * SCB[i] / 127