Principle:Bitsandbytes foundation Bitsandbytes LLM Int8 Linear Layer

Metadata

Field	Value
Sources	Paper: LLM.int8(), Repo: bitsandbytes
Domains	Quantization, Model_Architecture
Last updated	2026-02-07 14:00 GMT

Overview

A quantized neural network layer that stores weights in INT8 precision and uses mixed-precision decomposition for outlier features, serving as the drop-in replacement for standard nn.Linear layers in the LLM.int8() inference scheme.

Description

The 8-bit linear layer replaces torch.nn.Linear with a layer that stores weights as Int8Params -- a custom parameter class that automatically quantizes weights when transferred to a CUDA device.

The key mechanisms are:

Weight storage as Int8Params: On construction, the layer wraps its weight tensor in an Int8Params object. This custom parameter class overrides the .to(device) method to trigger row-wise INT8 quantization when the tensor is moved to a GPU.
Automatic quantization on device transfer: When Int8Params._quantize() is called (via .to(device)), it uses int8_vectorwise_quant to quantize the weight matrix. This produces the quantized weight tensor (CB) and per-row scaling factors (SCB), which are stored as attributes on the parameter.
Per-layer state tracking via MatmulLtState: Each layer maintains a MatmulLtState dataclass instance that tracks:
- CB: The INT8 quantized weight matrix
- SCB: Per-row scaling factors (float32)
- threshold: The outlier detection threshold
- has_fp16_weights: Whether FP16 copies are retained
- is_training: Current training/inference mode
- idx: Outlier column indices (when threshold > 0)
Optional FP16 weight retention: Unlike 4-bit layers, 8-bit layers can optionally keep FP16 weight copies (has_fp16_weights=True). This enables fine-tuning of quantized models because gradients can be computed and applied to the FP16 weights, while the INT8 version is recomputed for the forward pass.

During the forward pass, the layer delegates computation to bnb.matmul(), which dispatches to the MatMul8bitLt autograd function for mixed-precision INT8/FP16 matrix multiplication.

Usage

The 8-bit linear layer is used when you need mixed-precision inference with good accuracy preservation. It is particularly effective for large language models where approximately 0.1% of features are outliers.

Typical use cases include:

Inference-only deployment: Set has_fp16_weights=False and threshold=6.0 for maximum memory savings with outlier handling.
Fine-tuning with quantized weights: Set has_fp16_weights=True to retain FP16 weights for gradient computation.
Memory-constrained environments: When 4-bit quantization is too aggressive but full FP16 does not fit in memory.

Theoretical Basis

The 8-bit linear layer is grounded in row-wise INT8 quantization:

Per-row quantization:

Each row of the weight matrix is quantized independently. For row i:

scale_i = max(|W_i|) / 127
W_int8[i] = round(W[i] * 127 / scale_i)

This row-wise approach preserves more precision than global quantization because each row uses its own dynamic range.

MatmulLtState tracking:

The MatmulLtState dataclass maintains per-layer state that persists across forward passes:

CB (quantized weights): torch.Tensor with dtype torch.int8
SCB (scaling factors): torch.Tensor with dtype torch.float32, one value per row
Outlier indices: Column indices where feature magnitudes exceed the threshold

The state object avoids redundant re-quantization of weights across forward passes when has_fp16_weights=False, since the quantized representation is computed once during device transfer and then reused.

Dequantization for backward pass:

When gradients are needed, the INT8 weights are dequantized:

W_fp16[i] = W_int8[i] * SCB[i] / 127

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment