Principle:Axolotl ai cloud Axolotl Model Loading Quantized
| Knowledge Sources | |
|---|---|
| Domains | Model_Loading, Quantization, Memory_Optimization |
| Last Updated | 2026-02-06 23:00 GMT |
Overview
A model loading technique that applies quantization (4-bit or 8-bit) during weight loading to reduce GPU memory consumption while preserving model quality for fine-tuning.
Description
Quantized Model Loading is the process of loading pre-trained model weights with reduced numerical precision. Instead of loading weights as 16-bit or 32-bit floating point values, the model is loaded in 4-bit (NF4 or FP4) or 8-bit integer format. This dramatically reduces GPU memory requirements, enabling fine-tuning of large models (7B-70B+ parameters) on consumer hardware.
This technique is central to QLoRA (Quantized Low-Rank Adaptation), which combines 4-bit NormalFloat quantization with LoRA adapters to achieve near full-precision fine-tuning quality at a fraction of the memory cost. The quantized base model weights are frozen, and only the small LoRA adapter weights are trained in higher precision.
In Axolotl, the ModelLoader class handles quantized loading by configuring BitsAndBytesConfig from the YAML config, applying it during model instantiation via HuggingFace Transformers, and performing post-load patching for optimized training.
Usage
Use quantized model loading when:
- Fine-tuning models that exceed available GPU VRAM at full precision
- Using QLoRA or LoRA with quantized base models
- Training on consumer GPUs (e.g., RTX 3090/4090 with 24GB VRAM)
- Memory efficiency is more important than raw training speed
Theoretical Basis
Quantization reduces the bit-width of model parameters:
NormalFloat4 (NF4) quantization uses a non-uniform quantization scheme optimized for normally-distributed neural network weights:
- Compute quantiles of a standard normal distribution for 16 levels
- Map each weight to its nearest quantile
- Store as 4-bit indices with per-block scaling factors
Double Quantization further compresses the quantization constants themselves:
Key trade-offs:
- 4-bit NF4: ~4x memory reduction, minimal quality loss with LoRA
- 8-bit INT8: ~2x memory reduction, near-lossless
- GPTQ: Post-training quantization, requires calibration dataset