Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Bitsandbytes foundation Bitsandbytes 8bit LLM Int8 Inference

From Leeroopedia


Knowledge Sources
Domains LLMs, Quantization, Inference
Last Updated 2026-02-07 14:00 GMT

Overview

End-to-end process for loading a large language model in 8-bit precision using the LLM.int8() algorithm and running inference with mixed-precision outlier decomposition.

Description

This workflow implements the LLM.int8() algorithm for memory-efficient inference. The method uses vector-wise INT8 quantization for the majority of weight features, while detecting and separating outlier features (those exceeding a configurable threshold) for full-precision FP16 computation. This mixed-precision decomposition preserves model quality while halving memory requirements compared to FP16. Weights are quantized lazily when transferred to GPU, and the quantized weights plus row-wise scaling factors are stored persistently.

Usage

Execute this workflow when you need to run inference on large language models with approximately half the memory of FP16, and require zero degradation in output quality. LLM.int8() is preferred over 4-bit quantization when higher accuracy is needed, or when the model will be used with has_fp16_weights=True for mixed training and inference scenarios.

Execution Steps

Step 1: Configure 8-bit Quantization

Set up the quantization configuration for 8-bit mode. The key parameters are the load_in_8bit flag and the optional outlier threshold. When the threshold is set to a positive value (e.g., 6.0), the LLM.int8() mixed-precision decomposition is activated, routing outlier features through FP16 matmul while the remaining features use INT8.

Key considerations:

  • Setting threshold=0.0 disables outlier decomposition (pure INT8 quantization)
  • A threshold of 6.0 is the recommended default for activating mixed-precision decomposition
  • The has_fp16_weights parameter controls whether original FP16 weights are retained alongside quantized versions

Step 2: Load Model with 8-bit Configuration

Load the pretrained model, replacing standard nn.Linear layers with Linear8bitLt layers. Each layer's weights are wrapped in Int8Params parameter objects. The MatmulLtState dataclass is initialized per-layer to track the outlier threshold, quantization buffers, and training state.

Key considerations:

  • The model loader substitutes Linear layers with Linear8bitLt automatically when load_in_8bit=True
  • A load_state_dict pre-hook (maybe_rearrange_weight) ensures compatibility with different weight storage formats
  • Max memory constraints can be specified per GPU to control layer distribution

Step 3: Transfer to Device (Triggers INT8 Quantization)

When weights are moved to a GPU device, Int8Params.to() triggers quantization. The FP16 weights are quantized using vector-wise quantization: each row is scaled to fit in the INT8 range ([-127, 127]) using per-row absmax scaling factors. The quantized INT8 weights (CB) and scaling factors (SCB) are stored, and the original FP16 weights are discarded (unless has_fp16_weights=True).

Key considerations:

  • Quantization uses int8_vectorwise_quant which computes per-row scaling factors
  • The CB (quantized weights) and SCB (scaling factors) are initially stored in the Int8Params object
  • On first forward pass, init_8bit_state() transfers CB/SCB from the weight to the MatmulLtState

Step 4: Prepare Input Tokens

Tokenize the input text and move the token IDs to the model's device. This step is identical across quantization methods.

Key considerations:

  • Ensure the tokenizer matches the model architecture
  • Input tensors must reside on the same device as the model

Step 5: Run Forward Pass (Mixed-Precision Matmul)

During the forward pass, each Linear8bitLt layer performs the LLM.int8() matmul. The input activations are quantized to INT8, then multiplied with the stored INT8 weights. If the outlier threshold is active, the MatMul8bitLt autograd function first identifies outlier dimensions in the input (features with absolute values exceeding the threshold), routes those through a separate FP16 matmul path, and combines the results. The output is the sum of the INT8 matmul result (for normal features) and the FP16 matmul result (for outlier features).

Key considerations:

  • The bnb.matmul function dispatches to the appropriate backend (CUDA native or fallback)
  • Outlier detection occurs on the activation tensor, not the weights
  • The GlobalOutlierPooler can pool outlier dimensions across layers for consistency

Step 6: Generate Output

Use the model's generate method to produce output tokens. Each autoregressive step triggers the mixed-precision forward pass through all Linear8bitLt layers.

Key considerations:

  • Output quality is preserved compared to FP16 thanks to the outlier decomposition
  • Memory savings of approximately 50% compared to FP16 inference
  • The generated token IDs are decoded back to text using the tokenizer

Execution Diagram

GitHub URL

Workflow Repository