Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Bitsandbytes foundation Bitsandbytes Linear8bitLt

From Leeroopedia


Metadata

Field Value
Sources Repo: bitsandbytes, Paper: LLM.int8()
Domains Quantization, Model_Architecture
Type API Doc
Last updated 2026-02-07 14:00 GMT

Overview

Concrete tool for 8-bit quantized linear computation with LLM.int8() mixed-precision decomposition provided by the bitsandbytes library.

Description

Linear8bitLt extends torch.nn.Linear to provide an 8-bit quantized linear layer. On construction, the layer wraps its weight in an Int8Params parameter. When the module is moved to a CUDA device via .to(device), the Int8Params._quantize() method calls int8_vectorwise_quant to quantize the weights to INT8 with per-row scaling factors.

The forward pass calls bnb.matmul(), which dispatches to the MatMul8bitLt autograd function. This function performs the LLM.int8() mixed-precision decomposition: INT8 scaled matmul for non-outlier features and FP16 matmul for outlier features (when threshold > 0).

Key behaviors:

  • Quantization on device transfer: Weights are quantized when .to(cuda_device) is called, not at construction time.
  • State management: A MatmulLtState instance tracks the quantized weights (CB), scaling factors (SCB), threshold, and outlier indices.
  • FP16 weight mode: When has_fp16_weights=True, original FP16 weights are retained for fine-tuning. When False, only INT8 weights and scales are stored.
  • Bias handling: Bias is cast to match the input dtype before computation.

Code Reference

  • Source: bitsandbytes repo
  • File: bitsandbytes/nn/modules.py, Lines L937-1101
  • Import:
from bitsandbytes.nn import Linear8bitLt
  • Signature:
class Linear8bitLt(nn.Linear):
    def __init__(
        self,
        input_features: int,
        output_features: int,
        bias=True,
        has_fp16_weights=True,
        threshold=0.0,
        index=None,
        device=None,
    ):

I/O Contract

Constructor Inputs

Parameter Type Required Default Description
input_features int Yes -- Number of input features of the linear layer.
output_features int Yes -- Number of output features of the linear layer.
bias bool No True Whether the linear layer uses a bias term.
has_fp16_weights bool No True If True, retains FP16 weights for fine-tuning. If False, only INT8 weights are stored.
threshold float No 0.0 Outlier detection threshold. Set to 6.0 to enable LLM.int8() mixed-precision decomposition. 0.0 disables outlier handling.
index optional No None Optional index for tensor parallelism.
device optional No None Device for initial parameter allocation.

Forward Inputs

Parameter Type Required Description
x torch.Tensor Yes Input activation tensor.

Forward Outputs

Output Type Description
result torch.Tensor Output activation tensor with shape (*input_shape[:-1], output_features).

Usage Examples

Create a Linear8bitLt layer, quantize, and run a forward pass:

import torch
import torch.nn as nn
import bitsandbytes as bnb
from bitsandbytes.nn import Linear8bitLt

# Create an FP16 model and an 8-bit model with matching architecture
fp16_model = nn.Sequential(
    nn.Linear(64, 64),
    nn.Linear(64, 64),
)

int8_model = nn.Sequential(
    Linear8bitLt(64, 64, has_fp16_weights=False, threshold=6.0),
    Linear8bitLt(64, 64, has_fp16_weights=False, threshold=6.0),
)

# Load FP16 weights into the 8-bit model
int8_model.load_state_dict(fp16_model.state_dict())

# Quantization happens on device transfer
int8_model = int8_model.to(0)  # Move to GPU, triggers INT8 quantization

# Forward pass uses mixed-precision INT8/FP16 matmul
x = torch.randn(1, 64, dtype=torch.float16, device="cuda")
output = int8_model(x)

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment