Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Huggingface Transformers Model Quantization

From Leeroopedia
Knowledge Sources
Domains LLMs, Quantization, Inference, Optimization
Last Updated 2026-02-13 20:00 GMT

Overview

End-to-end process for loading and running pretrained models with reduced-precision weight quantization to minimize memory usage while preserving model quality.

Description

This workflow covers loading pretrained Transformer models with quantized weights using one of the 20 supported quantization backends. The primary focus is on bitsandbytes 4-bit quantization (NF4/FP4), which achieves approximately 4x memory reduction compared to float16 weights. The process includes configuring quantization parameters, loading the model with on-the-fly weight conversion, running inference with quantized weights, and optionally combining quantization with PEFT adapters for memory-efficient fine-tuning. The framework supports multiple backends including bitsandbytes, GPTQ, AWQ, GGML, TorchAO, HQQ, and others.

Usage

Execute this workflow when you need to run inference or fine-tuning on large language models but have limited GPU memory. Quantization is appropriate when a model's full-precision weights exceed available VRAM (e.g., running a 7B+ model on a single consumer GPU with less than 24GB VRAM). The trade-off is a small reduction in output quality in exchange for significant memory savings.

Execution Steps

Step 1: Select Quantization Backend

Choose the quantization backend appropriate for your use case. bitsandbytes is the most common choice for on-the-fly quantization during model loading. Pre-quantized model formats (GPTQ, AWQ, GGML) require models that have been quantized offline. Consider the trade-offs between quantization quality, memory savings, and inference speed.

Key considerations:

  • bitsandbytes: Best for on-the-fly quantization, supports 4-bit and 8-bit
  • GPTQ: Requires pre-quantized model weights, good inference speed
  • AWQ: Activation-aware quantization, requires pre-quantized weights
  • GGML/GGUF: Compatible with llama.cpp ecosystem
  • TorchAO: Native PyTorch quantization, supports various data types

Step 2: Configure Quantization Parameters

Create a quantization configuration object specifying the quantization scheme, data type, and optimization options. For bitsandbytes 4-bit, configure the quantization type (NF4 recommended), compute dtype, and whether to use double quantization for additional memory savings.

Key considerations:

  • NF4 (NormalFloat4) quantization type generally produces better quality than FP4
  • Double quantization adds a second level of quantization to the quantization constants
  • Compute dtype (float16 or bfloat16) affects the precision of intermediate computations
  • 4-bit provides ~4x memory reduction; 8-bit provides ~2x with less quality loss

Step 3: Load Model with Quantization

Load the pretrained model using AutoModelForCausalLM.from_pretrained() with the quantization configuration. The model weights are quantized on-the-fly during loading. Use device_map="auto" for automatic placement across available devices.

Key considerations:

  • Quantization happens during model loading, not as a separate step
  • device_map="auto" distributes layers across GPUs based on available memory
  • Linear layers are replaced with quantized equivalents (e.g., bnb.nn.Linear4bit)
  • The original model architecture is preserved; only weight representation changes

Step 4: Verify Quantization

Confirm that quantization was applied correctly by checking memory usage, verifying that linear layers are quantized, and validating that the model produces reasonable outputs.

Key considerations:

  • Check that memory usage is approximately 1/4 of the full-precision model (for 4-bit)
  • Verify quantized layer types in the model's module list
  • Run a sample inference to confirm output quality

Step 5: Run Inference

Use the quantized model for inference exactly as you would a full-precision model. The quantized weights are dequantized on-the-fly during computation, making the API transparent to the user.

Key considerations:

  • Inference API is identical to non-quantized models
  • Dequantization happens automatically during the forward pass
  • Batch size may need to be adjusted based on remaining memory
  • Generation parameters (max_new_tokens, temperature, etc.) work as normal

Step 6: Optional Fine-tuning with PEFT

Combine quantization with Parameter-Efficient Fine-Tuning by adding LoRA adapters on top of the frozen quantized base model. Only the small adapter weights are trained, enabling fine-tuning of large models on limited hardware.

Key considerations:

  • Quantized base weights remain frozen; only adapter weights are trained
  • This combination (QLoRA) enables fine-tuning of 7B+ models on consumer GPUs
  • Adapter weights are saved separately from the quantized base model
  • See the PEFT Adapter Integration workflow for detailed adapter training steps

Execution Diagram

GitHub URL

Workflow Repository