Implementation:Axolotl ai cloud Axolotl ModelLoader Load Multimodal

Knowledge Sources	Axolotl HuggingFace AutoModelForImageTextToText
Domains	Multimodal, Vision_Language, Model_Loading
Last Updated	2026-02-06 23:00 GMT

Overview

Concrete tool for loading vision-language models with automatic model class selection provided by the Axolotl framework.

Description

This implementation reuses the ModelLoader class but with multimodal-aware model class selection. The _set_auto_model_loader method checks if the model's type is present in MULTIMODAL_AUTO_MODEL_MAPPING (defined in constants.py). If found, it uses AutoModelForImageTextToText instead of the default AutoModelForCausalLM. The mapping includes all standard HuggingFace vision-language model types plus custom entries for newer models like "lfm2-vl" and "voxtral".

The MULTIMODAL_AUTO_MODEL_MAPPING is a dictionary built from HuggingFace's MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES with additional entries for models not yet in the official mapping.

Usage

Used automatically when ModelLoader detects a multimodal model. No separate invocation needed; the multimodal path is selected based on model configuration.

Code Reference

Source Location

Repository: axolotl
File: src/axolotl/loaders/model.py (ModelLoader._set_auto_model_loader at L434-444), src/axolotl/loaders/constants.py (MULTIMODAL_AUTO_MODEL_MAPPING at L1-18)
Lines: model.py L162-191 (load), L434-444 (_set_auto_model_loader); constants.py L1-18

Signature

# ModelLoader._set_auto_model_loader (selects model class)
# Called internally during ModelLoader.load()
def _set_auto_model_loader(self):
    """Set the auto model class based on model type.

    If model_type is in MULTIMODAL_AUTO_MODEL_MAPPING,
    uses AutoModelForImageTextToText.
    Otherwise uses AutoModelForCausalLM.
    """

# MULTIMODAL_AUTO_MODEL_MAPPING (model type registry)
MULTIMODAL_AUTO_MODEL_MAPPING = dict(MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES)
MULTIMODAL_AUTO_MODEL_MAPPING["lfm2-vl"] = AutoModelForImageTextToText
# Plus VoxtralForConditionalGeneration if available

Import

from axolotl.loaders.model import ModelLoader
from axolotl.loaders.constants import MULTIMODAL_AUTO_MODEL_MAPPING

I/O Contract

Inputs

Name	Type	Required	Description
cfg	DictDefault	Yes	Config with base_model pointing to a vision-language model (e.g., "meta-llama/Llama-3.2-11B-Vision")
tokenizer	PreTrainedTokenizerBase	Yes	Pre-loaded tokenizer

Outputs

Name	Type	Description
model	PreTrainedModel	Vision-language model instance (loaded via AutoModelForImageTextToText)
peft_config	PeftConfig or None	PEFT config if adapter was loaded

Usage Examples

Loading a Vision-Language Model

from axolotl.loaders.model import ModelLoader
from axolotl.loaders.tokenizer import load_tokenizer

# cfg.base_model = "meta-llama/Llama-3.2-11B-Vision"
# cfg.is_multimodal auto-detected from model_type
tokenizer = load_tokenizer(cfg)
loader = ModelLoader(cfg, tokenizer)
model, peft_config = loader.load()

# Model loaded via AutoModelForImageTextToText
print(type(model))  # LlamaForImageTextToText
print(hasattr(model, "vision_tower"))  # True

Checking Multimodal Support

from axolotl.loaders.constants import MULTIMODAL_AUTO_MODEL_MAPPING

# Check if a model type is multimodal
print("llava" in MULTIMODAL_AUTO_MODEL_MAPPING)   # True
print("llama" in MULTIMODAL_AUTO_MODEL_MAPPING)    # False (text-only)
print("qwen2_vl" in MULTIMODAL_AUTO_MODEL_MAPPING) # True

Related Pages

Implements Principle

Principle:Axolotl_ai_cloud_Axolotl_Vision_Language_Model_Loading

Requires Environment

Environment:Axolotl_ai_cloud_Axolotl_CUDA_GPU

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment