Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Axolotl ai cloud Axolotl ModelLoader Load

From Leeroopedia


Knowledge Sources
Domains Model_Loading, Quantization
Last Updated 2026-02-06 23:00 GMT

Overview

Concrete tool for loading pre-trained language models with optional quantization provided by the Axolotl framework.

Description

The ModelLoader class handles the complete model loading pipeline in Axolotl. It configures quantization (4-bit NF4, 8-bit INT8, GPTQ), sets up device mapping for multi-GPU, applies model-specific patches (flash attention, RoPE scaling), and instantiates the model via HuggingFace AutoModelForCausalLM. The load() method orchestrates the full pipeline and returns the model with an optional PeftConfig.

Key responsibilities include:

  • Configuring BitsAndBytesConfig for quantized loading
  • Setting up device maps for model parallelism
  • Applying monkey patches for optimized training
  • Handling model architecture-specific quirks (embedding resizing, dtype fixes)

Usage

Use this implementation when loading a causal language model for QLoRA/LoRA fine-tuning. The ModelLoader handles all quantization configuration automatically based on the YAML config.

Code Reference

Source Location

  • Repository: axolotl
  • File: src/axolotl/loaders/model.py
  • Lines: L67-883 (class), L98-144 (init), L162-191 (load method), L515-597 (quantization config), L698-815 (build model)

Signature

class ModelLoader:
    """Load pretrained models with quantization and patching support."""

    def __init__(
        self,
        cfg: DictDefault,
        tokenizer: PreTrainedTokenizerBase,
        *,
        inference: bool = False,
        reference_model: bool = False,
        **kwargs,
    ):
        """
        Args:
            cfg: Training configuration dictionary.
            tokenizer: Pre-loaded tokenizer instance.
            inference: Whether loading for inference (disables training optimizations).
            reference_model: Whether loading as DPO reference model.
            **kwargs: Additional keyword arguments.
        """

    def load(self) -> tuple[PreTrainedModel | PeftModelForCausalLM, PeftConfig | None]:
        """Load and configure the model.

        Returns:
            Tuple of (model instance with quantization applied, PeftConfig or None).
        """

Import

from axolotl.loaders.model import ModelLoader

I/O Contract

Inputs

Name Type Required Description
cfg DictDefault Yes Config with base_model, load_in_4bit, load_in_8bit, quant_type, bf16/fp16, flash_attention, device_map, etc.
tokenizer PreTrainedTokenizerBase Yes Pre-loaded tokenizer for embedding resizing
inference bool No (default: False) Load for inference only (skip training optimizations)
reference_model bool No (default: False) Load as DPO reference model

Outputs

Name Type Description
model PreTrainedModel or PeftModelForCausalLM Loaded model with quantization and patches applied
peft_config PeftConfig or None PEFT configuration if adapter was loaded from checkpoint, None otherwise

Usage Examples

Loading a QLoRA Model

from axolotl.loaders.model import ModelLoader
from axolotl.loaders.tokenizer import load_tokenizer

# Config specifies 4-bit quantization
# cfg.base_model = "meta-llama/Llama-3.2-1B"
# cfg.load_in_4bit = True
# cfg.quant_type = "nf4"

tokenizer = load_tokenizer(cfg)
loader = ModelLoader(cfg, tokenizer)
model, peft_config = loader.load()

print(model.dtype)  # torch.float16 (compute dtype)
print(model.config.quantization_config)  # BitsAndBytesConfig

Loading for Inference

loader = ModelLoader(cfg, tokenizer, inference=True)
model, _ = loader.load()

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment