Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Axolotl ai cloud Axolotl Model Loading Quantized

From Leeroopedia


Knowledge Sources
Domains Model_Loading, Quantization, Memory_Optimization
Last Updated 2026-02-06 23:00 GMT

Overview

A model loading technique that applies quantization (4-bit or 8-bit) during weight loading to reduce GPU memory consumption while preserving model quality for fine-tuning.

Description

Quantized Model Loading is the process of loading pre-trained model weights with reduced numerical precision. Instead of loading weights as 16-bit or 32-bit floating point values, the model is loaded in 4-bit (NF4 or FP4) or 8-bit integer format. This dramatically reduces GPU memory requirements, enabling fine-tuning of large models (7B-70B+ parameters) on consumer hardware.

This technique is central to QLoRA (Quantized Low-Rank Adaptation), which combines 4-bit NormalFloat quantization with LoRA adapters to achieve near full-precision fine-tuning quality at a fraction of the memory cost. The quantized base model weights are frozen, and only the small LoRA adapter weights are trained in higher precision.

In Axolotl, the ModelLoader class handles quantized loading by configuring BitsAndBytesConfig from the YAML config, applying it during model instantiation via HuggingFace Transformers, and performing post-load patching for optimized training.

Usage

Use quantized model loading when:

  • Fine-tuning models that exceed available GPU VRAM at full precision
  • Using QLoRA or LoRA with quantized base models
  • Training on consumer GPUs (e.g., RTX 3090/4090 with 24GB VRAM)
  • Memory efficiency is more important than raw training speed

Theoretical Basis

Quantization reduces the bit-width of model parameters:

Wquantized=quantize(Wfp16,dtype=nf4)

NormalFloat4 (NF4) quantization uses a non-uniform quantization scheme optimized for normally-distributed neural network weights:

  1. Compute quantiles of a standard normal distribution for 16 levels
  2. Map each weight to its nearest quantile
  3. Store as 4-bit indices with per-block scaling factors

Double Quantization further compresses the quantization constants themselves: memory=N×48+N64×32 (single)N×48+N64×8 (double)

Key trade-offs:

  • 4-bit NF4: ~4x memory reduction, minimal quality loss with LoRA
  • 8-bit INT8: ~2x memory reduction, near-lossless
  • GPTQ: Post-training quantization, requires calibration dataset

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment