Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Bitsandbytes foundation Bitsandbytes 4bit QLoRA Inference

From Leeroopedia


Knowledge Sources
Domains LLMs, Quantization, Inference
Last Updated 2026-02-07 14:00 GMT

Overview

End-to-end process for loading a large language model in 4-bit precision (NF4 or FP4) and running memory-efficient inference using bitsandbytes quantization.

Description

This workflow covers the complete path from a pretrained model on disk (or Hugging Face Hub) to generating text in 4-bit quantized mode. The 4-bit quantization algorithm from QLoRA uses block-wise NormalFloat (NF4) or floating-point (FP4) quantization to compress model weights to 4 bits per parameter, reducing GPU memory by approximately 75% compared to FP16. Weights are quantized lazily on first device transfer and dequantized on-the-fly during each forward pass. The workflow supports optional double quantization (compressing the quantization statistics themselves) and configurable compute dtypes for the dequantized matmul operations.

Usage

Execute this workflow when you need to run inference on a large language model (7B+ parameters) but have limited GPU memory (e.g., less than 16GB VRAM). This is the standard path for deploying quantized models on consumer GPUs and is fully compatible with the Hugging Face Transformers and torch.compile pipelines.

Execution Steps

Step 1: Configure Quantization Parameters

Define the quantization configuration specifying 4-bit mode. This includes selecting the quantization type (NF4 for better quality or FP4 for simpler representation), the compute dtype used during dequantized matmul operations (bfloat16 recommended for modern GPUs), whether to use double quantization to further compress quantization statistics, and the storage dtype for the packed quantized weights.

Key considerations:

  • NF4 is preferred over FP4 for most use cases as it better represents normally-distributed weight values
  • Setting compute_dtype to bfloat16 provides the best speed and numerical stability on Ampere+ GPUs
  • Double quantization adds a small overhead but further reduces memory by quantizing the per-block absmax statistics

Step 2: Load Model with Quantization

Load the pretrained model from the Hugging Face Hub or local path, passing the quantization configuration. The model loader replaces standard nn.Linear layers with Linear4bit layers, wrapping weights in Params4bit parameter objects. At this stage, weights remain in their original precision on CPU.

Key considerations:

  • The device_map="auto" option distributes layers across available GPUs and CPU as needed
  • The torch_dtype parameter sets the dtype for non-quantized parameters (embeddings, layer norms)
  • Weights are NOT quantized during loading; quantization is deferred to device transfer

Step 3: Transfer to Device (Triggers Quantization)

When the model (or individual layers) is moved to a GPU via .to(device) or device_map placement, the Params4bit.to() method triggers lazy quantization. Each weight tensor is quantized block-wise: the tensor is divided into fixed-size blocks (default 64 elements), each block is normalized by its absolute maximum, and values are mapped to the nearest 4-bit quantization level. The resulting packed weights and QuantState metadata (absmax scales, block size, nested quantization state) are stored.

Key considerations:

  • Quantization is a one-time operation triggered on first device transfer
  • Block size of 64 is default for CUDA; 128 for ROCm (due to warp size differences)
  • The QuantState object preserves all metadata needed for dequantization

Step 4: Prepare Input Tokens

Tokenize the input text using the model's corresponding tokenizer. The tokenized input IDs must be placed on the same device as the model for inference.

Key considerations:

  • Ensure the tokenizer matches the model (same vocabulary and special tokens)
  • Move input tensors to the model's device before running forward pass

Step 5: Run Forward Pass (Dequantize and Matmul)

During the forward pass, each Linear4bit layer dequantizes its weights on-the-fly. The packed 4-bit weights are unpacked, scaled using the stored absmax values, and cast to the compute dtype. The dequantized weights are then used in a standard matrix multiplication with the input activations. The result is cast back to the input dtype. This dequantization happens every forward pass; no full-precision weights are stored persistently.

Key considerations:

  • Dequantization is performed by the backend-specific implementation (CUDA, Triton, CPU, or default PyTorch)
  • The matmul_4bit function dispatches to the appropriate backend via torch.library custom ops
  • Memory usage stays low because only one layer's weights are dequantized at a time during the forward pass

Step 6: Generate Output

Use the model's generate method (or manual autoregressive loop) to produce output tokens. Each generation step triggers a forward pass through the quantized model, repeating the dequantize-matmul cycle for every Linear4bit layer.

Key considerations:

  • Generation speed depends on compute_dtype; bfloat16 is fastest on modern hardware
  • torch.compile can be applied to the model for additional performance optimization
  • The generated token IDs are decoded back to text using the tokenizer

Execution Diagram

GitHub URL

Workflow Repository