Implementation:Axolotl ai cloud Axolotl ModelLoader Load

Knowledge Sources	Axolotl HuggingFace Transformers BitsAndBytes
Domains	Model_Loading, Quantization
Last Updated	2026-02-06 23:00 GMT

Overview

Concrete tool for loading pre-trained language models with optional quantization provided by the Axolotl framework.

Description

The ModelLoader class handles the complete model loading pipeline in Axolotl. It configures quantization (4-bit NF4, 8-bit INT8, GPTQ), sets up device mapping for multi-GPU, applies model-specific patches (flash attention, RoPE scaling), and instantiates the model via HuggingFace AutoModelForCausalLM. The load() method orchestrates the full pipeline and returns the model with an optional PeftConfig.

Key responsibilities include:

Configuring BitsAndBytesConfig for quantized loading
Setting up device maps for model parallelism
Applying monkey patches for optimized training
Handling model architecture-specific quirks (embedding resizing, dtype fixes)

Usage

Use this implementation when loading a causal language model for QLoRA/LoRA fine-tuning. The ModelLoader handles all quantization configuration automatically based on the YAML config.

Code Reference

Source Location

Repository: axolotl
File: src/axolotl/loaders/model.py
Lines: L67-883 (class), L98-144 (init), L162-191 (load method), L515-597 (quantization config), L698-815 (build model)

Signature

class ModelLoader:
    """Load pretrained models with quantization and patching support."""

    def __init__(
        self,
        cfg: DictDefault,
        tokenizer: PreTrainedTokenizerBase,
        *,
        inference: bool = False,
        reference_model: bool = False,
        **kwargs,
    ):
        """
        Args:
            cfg: Training configuration dictionary.
            tokenizer: Pre-loaded tokenizer instance.
            inference: Whether loading for inference (disables training optimizations).
            reference_model: Whether loading as DPO reference model.
            **kwargs: Additional keyword arguments.
        """

    def load(self) -> tuple[PreTrainedModel | PeftModelForCausalLM, PeftConfig | None]:
        """Load and configure the model.

        Returns:
            Tuple of (model instance with quantization applied, PeftConfig or None).
        """

Import

from axolotl.loaders.model import ModelLoader

I/O Contract

Inputs

Name	Type	Required	Description
cfg	DictDefault	Yes	Config with base_model, load_in_4bit, load_in_8bit, quant_type, bf16/fp16, flash_attention, device_map, etc.
tokenizer	PreTrainedTokenizerBase	Yes	Pre-loaded tokenizer for embedding resizing
inference	bool	No (default: False)	Load for inference only (skip training optimizations)
reference_model	bool	No (default: False)	Load as DPO reference model

Outputs

Name	Type	Description
model	PreTrainedModel or PeftModelForCausalLM	Loaded model with quantization and patches applied
peft_config	PeftConfig or None	PEFT configuration if adapter was loaded from checkpoint, None otherwise

Usage Examples

Loading a QLoRA Model

from axolotl.loaders.model import ModelLoader
from axolotl.loaders.tokenizer import load_tokenizer

# Config specifies 4-bit quantization
# cfg.base_model = "meta-llama/Llama-3.2-1B"
# cfg.load_in_4bit = True
# cfg.quant_type = "nf4"

tokenizer = load_tokenizer(cfg)
loader = ModelLoader(cfg, tokenizer)
model, peft_config = loader.load()

print(model.dtype)  # torch.float16 (compute dtype)
print(model.config.quantization_config)  # BitsAndBytesConfig

Loading for Inference

loader = ModelLoader(cfg, tokenizer, inference=True)
model, _ = loader.load()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment