Implementation:Bitsandbytes foundation Bitsandbytes Linear8bitLt

Metadata

Field	Value
Sources	Repo: bitsandbytes, Paper: LLM.int8()
Domains	Quantization, Model_Architecture
Type	API Doc
Last updated	2026-02-07 14:00 GMT

Overview

Concrete tool for 8-bit quantized linear computation with LLM.int8() mixed-precision decomposition provided by the bitsandbytes library.

Description

Linear8bitLt extends torch.nn.Linear to provide an 8-bit quantized linear layer. On construction, the layer wraps its weight in an Int8Params parameter. When the module is moved to a CUDA device via .to(device), the Int8Params._quantize() method calls int8_vectorwise_quant to quantize the weights to INT8 with per-row scaling factors.

The forward pass calls bnb.matmul(), which dispatches to the MatMul8bitLt autograd function. This function performs the LLM.int8() mixed-precision decomposition: INT8 scaled matmul for non-outlier features and FP16 matmul for outlier features (when threshold > 0).

Key behaviors:

Quantization on device transfer: Weights are quantized when .to(cuda_device) is called, not at construction time.
State management: A MatmulLtState instance tracks the quantized weights (CB), scaling factors (SCB), threshold, and outlier indices.
FP16 weight mode: When has_fp16_weights=True, original FP16 weights are retained for fine-tuning. When False, only INT8 weights and scales are stored.
Bias handling: Bias is cast to match the input dtype before computation.

Code Reference

Source: bitsandbytes repo
File: bitsandbytes/nn/modules.py, Lines L937-1101
Import:

from bitsandbytes.nn import Linear8bitLt

Signature:

class Linear8bitLt(nn.Linear):
    def __init__(
        self,
        input_features: int,
        output_features: int,
        bias=True,
        has_fp16_weights=True,
        threshold=0.0,
        index=None,
        device=None,
    ):

I/O Contract

Constructor Inputs

Parameter	Type	Required	Default	Description
`input_features`	int	Yes	--	Number of input features of the linear layer.
`output_features`	int	Yes	--	Number of output features of the linear layer.
`bias`	bool	No	`True`	Whether the linear layer uses a bias term.
`has_fp16_weights`	bool	No	`True`	If `True`, retains FP16 weights for fine-tuning. If `False`, only INT8 weights are stored.
`threshold`	float	No	`0.0`	Outlier detection threshold. Set to `6.0` to enable LLM.int8() mixed-precision decomposition. `0.0` disables outlier handling.
`index`	optional	No	`None`	Optional index for tensor parallelism.
`device`	optional	No	`None`	Device for initial parameter allocation.

Forward Inputs

Parameter	Type	Required	Description
`x`	`torch.Tensor`	Yes	Input activation tensor.

Forward Outputs

Output	Type	Description
result	`torch.Tensor`	Output activation tensor with shape `(*input_shape[:-1], output_features)`.

Usage Examples

Create a Linear8bitLt layer, quantize, and run a forward pass:

import torch
import torch.nn as nn
import bitsandbytes as bnb
from bitsandbytes.nn import Linear8bitLt

# Create an FP16 model and an 8-bit model with matching architecture
fp16_model = nn.Sequential(
    nn.Linear(64, 64),
    nn.Linear(64, 64),
)

int8_model = nn.Sequential(
    Linear8bitLt(64, 64, has_fp16_weights=False, threshold=6.0),
    Linear8bitLt(64, 64, has_fp16_weights=False, threshold=6.0),
)

# Load FP16 weights into the 8-bit model
int8_model.load_state_dict(fp16_model.state_dict())

# Quantization happens on device transfer
int8_model = int8_model.to(0)  # Move to GPU, triggers INT8 quantization

# Forward pass uses mixed-precision INT8/FP16 matmul
x = torch.randn(1, 64, dtype=torch.float16, device="cuda")
output = int8_model(x)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment