Implementation:Bitsandbytes foundation Bitsandbytes Linear8bitLt
Metadata
| Field | Value |
|---|---|
| Sources | Repo: bitsandbytes, Paper: LLM.int8() |
| Domains | Quantization, Model_Architecture |
| Type | API Doc |
| Last updated | 2026-02-07 14:00 GMT |
Overview
Concrete tool for 8-bit quantized linear computation with LLM.int8() mixed-precision decomposition provided by the bitsandbytes library.
Description
Linear8bitLt extends torch.nn.Linear to provide an 8-bit quantized linear layer. On construction, the layer wraps its weight in an Int8Params parameter. When the module is moved to a CUDA device via .to(device), the Int8Params._quantize() method calls int8_vectorwise_quant to quantize the weights to INT8 with per-row scaling factors.
The forward pass calls bnb.matmul(), which dispatches to the MatMul8bitLt autograd function. This function performs the LLM.int8() mixed-precision decomposition: INT8 scaled matmul for non-outlier features and FP16 matmul for outlier features (when threshold > 0).
Key behaviors:
- Quantization on device transfer: Weights are quantized when
.to(cuda_device)is called, not at construction time. - State management: A
MatmulLtStateinstance tracks the quantized weights (CB), scaling factors (SCB), threshold, and outlier indices. - FP16 weight mode: When
has_fp16_weights=True, original FP16 weights are retained for fine-tuning. WhenFalse, only INT8 weights and scales are stored. - Bias handling: Bias is cast to match the input dtype before computation.
Code Reference
- Source: bitsandbytes repo
- File:
bitsandbytes/nn/modules.py, Lines L937-1101 - Import:
from bitsandbytes.nn import Linear8bitLt
- Signature:
class Linear8bitLt(nn.Linear):
def __init__(
self,
input_features: int,
output_features: int,
bias=True,
has_fp16_weights=True,
threshold=0.0,
index=None,
device=None,
):
I/O Contract
Constructor Inputs
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
input_features |
int | Yes | -- | Number of input features of the linear layer. |
output_features |
int | Yes | -- | Number of output features of the linear layer. |
bias |
bool | No | True |
Whether the linear layer uses a bias term. |
has_fp16_weights |
bool | No | True |
If True, retains FP16 weights for fine-tuning. If False, only INT8 weights are stored.
|
threshold |
float | No | 0.0 |
Outlier detection threshold. Set to 6.0 to enable LLM.int8() mixed-precision decomposition. 0.0 disables outlier handling.
|
index |
optional | No | None |
Optional index for tensor parallelism. |
device |
optional | No | None |
Device for initial parameter allocation. |
Forward Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
x |
torch.Tensor |
Yes | Input activation tensor. |
Forward Outputs
| Output | Type | Description |
|---|---|---|
| result | torch.Tensor |
Output activation tensor with shape (*input_shape[:-1], output_features).
|
Usage Examples
Create a Linear8bitLt layer, quantize, and run a forward pass:
import torch
import torch.nn as nn
import bitsandbytes as bnb
from bitsandbytes.nn import Linear8bitLt
# Create an FP16 model and an 8-bit model with matching architecture
fp16_model = nn.Sequential(
nn.Linear(64, 64),
nn.Linear(64, 64),
)
int8_model = nn.Sequential(
Linear8bitLt(64, 64, has_fp16_weights=False, threshold=6.0),
Linear8bitLt(64, 64, has_fp16_weights=False, threshold=6.0),
)
# Load FP16 weights into the 8-bit model
int8_model.load_state_dict(fp16_model.state_dict())
# Quantization happens on device transfer
int8_model = int8_model.to(0) # Move to GPU, triggers INT8 quantization
# Forward pass uses mixed-precision INT8/FP16 matmul
x = torch.randn(1, 64, dtype=torch.float16, device="cuda")
output = int8_model(x)