Implementation:Axolotl ai cloud Axolotl ModelLoader Load Multimodal
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, Vision_Language, Model_Loading |
| Last Updated | 2026-02-06 23:00 GMT |
Overview
Concrete tool for loading vision-language models with automatic model class selection provided by the Axolotl framework.
Description
This implementation reuses the ModelLoader class but with multimodal-aware model class selection. The _set_auto_model_loader method checks if the model's type is present in MULTIMODAL_AUTO_MODEL_MAPPING (defined in constants.py). If found, it uses AutoModelForImageTextToText instead of the default AutoModelForCausalLM. The mapping includes all standard HuggingFace vision-language model types plus custom entries for newer models like "lfm2-vl" and "voxtral".
The MULTIMODAL_AUTO_MODEL_MAPPING is a dictionary built from HuggingFace's MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES with additional entries for models not yet in the official mapping.
Usage
Used automatically when ModelLoader detects a multimodal model. No separate invocation needed; the multimodal path is selected based on model configuration.
Code Reference
Source Location
- Repository: axolotl
- File: src/axolotl/loaders/model.py (ModelLoader._set_auto_model_loader at L434-444), src/axolotl/loaders/constants.py (MULTIMODAL_AUTO_MODEL_MAPPING at L1-18)
- Lines: model.py L162-191 (load), L434-444 (_set_auto_model_loader); constants.py L1-18
Signature
# ModelLoader._set_auto_model_loader (selects model class)
# Called internally during ModelLoader.load()
def _set_auto_model_loader(self):
"""Set the auto model class based on model type.
If model_type is in MULTIMODAL_AUTO_MODEL_MAPPING,
uses AutoModelForImageTextToText.
Otherwise uses AutoModelForCausalLM.
"""
# MULTIMODAL_AUTO_MODEL_MAPPING (model type registry)
MULTIMODAL_AUTO_MODEL_MAPPING = dict(MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES)
MULTIMODAL_AUTO_MODEL_MAPPING["lfm2-vl"] = AutoModelForImageTextToText
# Plus VoxtralForConditionalGeneration if available
Import
from axolotl.loaders.model import ModelLoader
from axolotl.loaders.constants import MULTIMODAL_AUTO_MODEL_MAPPING
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| cfg | DictDefault | Yes | Config with base_model pointing to a vision-language model (e.g., "meta-llama/Llama-3.2-11B-Vision") |
| tokenizer | PreTrainedTokenizerBase | Yes | Pre-loaded tokenizer |
Outputs
| Name | Type | Description |
|---|---|---|
| model | PreTrainedModel | Vision-language model instance (loaded via AutoModelForImageTextToText) |
| peft_config | PeftConfig or None | PEFT config if adapter was loaded |
Usage Examples
Loading a Vision-Language Model
from axolotl.loaders.model import ModelLoader
from axolotl.loaders.tokenizer import load_tokenizer
# cfg.base_model = "meta-llama/Llama-3.2-11B-Vision"
# cfg.is_multimodal auto-detected from model_type
tokenizer = load_tokenizer(cfg)
loader = ModelLoader(cfg, tokenizer)
model, peft_config = loader.load()
# Model loaded via AutoModelForImageTextToText
print(type(model)) # LlamaForImageTextToText
print(hasattr(model, "vision_tower")) # True
Checking Multimodal Support
from axolotl.loaders.constants import MULTIMODAL_AUTO_MODEL_MAPPING
# Check if a model type is multimodal
print("llava" in MULTIMODAL_AUTO_MODEL_MAPPING) # True
print("llama" in MULTIMODAL_AUTO_MODEL_MAPPING) # False (text-only)
print("qwen2_vl" in MULTIMODAL_AUTO_MODEL_MAPPING) # True