Workflow:Huggingface Transformers Model Quantization
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Quantization, Inference, Optimization |
| Last Updated | 2026-02-13 20:00 GMT |
Overview
End-to-end process for loading and running pretrained models with reduced-precision weight quantization to minimize memory usage while preserving model quality.
Description
This workflow covers loading pretrained Transformer models with quantized weights using one of the 20 supported quantization backends. The primary focus is on bitsandbytes 4-bit quantization (NF4/FP4), which achieves approximately 4x memory reduction compared to float16 weights. The process includes configuring quantization parameters, loading the model with on-the-fly weight conversion, running inference with quantized weights, and optionally combining quantization with PEFT adapters for memory-efficient fine-tuning. The framework supports multiple backends including bitsandbytes, GPTQ, AWQ, GGML, TorchAO, HQQ, and others.
Usage
Execute this workflow when you need to run inference or fine-tuning on large language models but have limited GPU memory. Quantization is appropriate when a model's full-precision weights exceed available VRAM (e.g., running a 7B+ model on a single consumer GPU with less than 24GB VRAM). The trade-off is a small reduction in output quality in exchange for significant memory savings.
Execution Steps
Step 1: Select Quantization Backend
Choose the quantization backend appropriate for your use case. bitsandbytes is the most common choice for on-the-fly quantization during model loading. Pre-quantized model formats (GPTQ, AWQ, GGML) require models that have been quantized offline. Consider the trade-offs between quantization quality, memory savings, and inference speed.
Key considerations:
- bitsandbytes: Best for on-the-fly quantization, supports 4-bit and 8-bit
- GPTQ: Requires pre-quantized model weights, good inference speed
- AWQ: Activation-aware quantization, requires pre-quantized weights
- GGML/GGUF: Compatible with llama.cpp ecosystem
- TorchAO: Native PyTorch quantization, supports various data types
Step 2: Configure Quantization Parameters
Create a quantization configuration object specifying the quantization scheme, data type, and optimization options. For bitsandbytes 4-bit, configure the quantization type (NF4 recommended), compute dtype, and whether to use double quantization for additional memory savings.
Key considerations:
- NF4 (NormalFloat4) quantization type generally produces better quality than FP4
- Double quantization adds a second level of quantization to the quantization constants
- Compute dtype (float16 or bfloat16) affects the precision of intermediate computations
- 4-bit provides ~4x memory reduction; 8-bit provides ~2x with less quality loss
Step 3: Load Model with Quantization
Load the pretrained model using AutoModelForCausalLM.from_pretrained() with the quantization configuration. The model weights are quantized on-the-fly during loading. Use device_map="auto" for automatic placement across available devices.
Key considerations:
- Quantization happens during model loading, not as a separate step
- device_map="auto" distributes layers across GPUs based on available memory
- Linear layers are replaced with quantized equivalents (e.g., bnb.nn.Linear4bit)
- The original model architecture is preserved; only weight representation changes
Step 4: Verify Quantization
Confirm that quantization was applied correctly by checking memory usage, verifying that linear layers are quantized, and validating that the model produces reasonable outputs.
Key considerations:
- Check that memory usage is approximately 1/4 of the full-precision model (for 4-bit)
- Verify quantized layer types in the model's module list
- Run a sample inference to confirm output quality
Step 5: Run Inference
Use the quantized model for inference exactly as you would a full-precision model. The quantized weights are dequantized on-the-fly during computation, making the API transparent to the user.
Key considerations:
- Inference API is identical to non-quantized models
- Dequantization happens automatically during the forward pass
- Batch size may need to be adjusted based on remaining memory
- Generation parameters (max_new_tokens, temperature, etc.) work as normal
Step 6: Optional Fine-tuning with PEFT
Combine quantization with Parameter-Efficient Fine-Tuning by adding LoRA adapters on top of the frozen quantized base model. Only the small adapter weights are trained, enabling fine-tuning of large models on limited hardware.
Key considerations:
- Quantized base weights remain frozen; only adapter weights are trained
- This combination (QLoRA) enables fine-tuning of 7B+ models on consumer GPUs
- Adapter weights are saved separately from the quantized base model
- See the PEFT Adapter Integration workflow for detailed adapter training steps