Principle:Axolotl ai cloud Axolotl Model Loading Quantized

Knowledge Sources	QLoRA GPTQ BitsAndBytes Axolotl
Domains	Model_Loading, Quantization, Memory_Optimization
Last Updated	2026-02-06 23:00 GMT

Overview

A model loading technique that applies quantization (4-bit or 8-bit) during weight loading to reduce GPU memory consumption while preserving model quality for fine-tuning.

Description

Quantized Model Loading is the process of loading pre-trained model weights with reduced numerical precision. Instead of loading weights as 16-bit or 32-bit floating point values, the model is loaded in 4-bit (NF4 or FP4) or 8-bit integer format. This dramatically reduces GPU memory requirements, enabling fine-tuning of large models (7B-70B+ parameters) on consumer hardware.

This technique is central to QLoRA (Quantized Low-Rank Adaptation), which combines 4-bit NormalFloat quantization with LoRA adapters to achieve near full-precision fine-tuning quality at a fraction of the memory cost. The quantized base model weights are frozen, and only the small LoRA adapter weights are trained in higher precision.

In Axolotl, the ModelLoader class handles quantized loading by configuring BitsAndBytesConfig from the YAML config, applying it during model instantiation via HuggingFace Transformers, and performing post-load patching for optimized training.

Usage

Use quantized model loading when:

Fine-tuning models that exceed available GPU VRAM at full precision
Using QLoRA or LoRA with quantized base models
Training on consumer GPUs (e.g., RTX 3090/4090 with 24GB VRAM)
Memory efficiency is more important than raw training speed

Theoretical Basis

Quantization reduces the bit-width of model parameters:

$W_{q u a n t i z e d} = quantize (W_{f p 16}, dtype = nf4)$

NormalFloat4 (NF4) quantization uses a non-uniform quantization scheme optimized for normally-distributed neural network weights:

Compute quantiles of a standard normal distribution for 16 levels
Map each weight to its nearest quantile
Store as 4-bit indices with per-block scaling factors

Double Quantization further compresses the quantization constants themselves: $memory = \frac{N \times 4}{8} + \frac{N}{64} \times 32 (single) \to \frac{N \times 4}{8} + \frac{N}{64} \times 8 (double)$

Key trade-offs:

4-bit NF4: ~4x memory reduction, minimal quality loss with LoRA
8-bit INT8: ~2x memory reduction, near-lossless
GPTQ: Post-training quantization, requires calibration dataset

Related Pages

Implemented By

Implementation:Axolotl_ai_cloud_Axolotl_ModelLoader_Load

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment