Workflow:Axolotl ai cloud Axolotl Multimodal Vision Finetuning

Knowledge Sources	Axolotl Axolotl Docs Multimodal Training
Domains	LLMs, Fine_Tuning, Multimodal, Vision_Language
Last Updated	2026-02-06 22:00 GMT

Overview

End-to-end process for fine-tuning vision-language models (VLMs) on image-text instruction data using Axolotl's multimodal training pipeline with LoRA adapters.

Description

This workflow covers fine-tuning of multimodal models that process both images and text, such as Llama 3.2 Vision, Qwen2-VL, Pixtral, InternVL, SmolVLM2, and LLaVA. The process extends the standard LoRA fine-tuning workflow with multimodal-specific steps: loading a processor (which handles both image preprocessing and text tokenization), configuring vision-specific chat templates, handling image data within conversation datasets, and targeting LoRA adapters to appropriate model layers (language model, cross-attention, or vision encoder). Axolotl also supports audio modalities (Voxtral, Gemma-3n) through the same pipeline.

Usage

Execute this workflow when you have a dataset containing image-text (or audio-text) instruction pairs and need to adapt a pre-trained vision-language model to understand domain-specific visual content, follow visual instructions more accurately, or learn new visual question-answering capabilities. Typical use cases include document understanding, medical image analysis, visual inspection, and custom visual assistants.

Execution Steps

Step 1: Configuration for Multimodal Training

Create a YAML configuration file specifying the vision-language base model, multimodal dataset, processor type, and vision-specific settings. The configuration must include processor_type: AutoProcessor to handle combined image and text preprocessing, a vision-appropriate chat template, and dataset settings for multimodal content.

Key considerations:

Set processor_type: AutoProcessor for multimodal models
Use the model-specific chat template (e.g., llama3_2_vision, qwen2_vl)
Set skip_prepare_dataset: true for vision datasets (on-the-fly processing)
Set remove_unused_columns: false to preserve image data columns
Disable sample_packing as it is not compatible with multimodal inputs

Step 2: Dataset and Processor Loading

Load the multimodal dataset and initialize the processor. The processor combines a vision encoder's image preprocessor with the language model's tokenizer. Datasets are loaded in a format that includes image URLs or paths alongside text conversations, and the processor handles image resizing, normalization, and pixel value extraction alongside text tokenization.

Key considerations:

Datasets must contain image references (URLs, file paths, or PIL images)
The processor handles both image preprocessing and text tokenization
Vision-chat datasets follow conversational format with embedded image tokens
On-the-fly preprocessing is used instead of pre-tokenization

Step 3: Vision Language Model Loading

Load the pre-trained vision-language model with the appropriate model class. Axolotl detects multimodal architectures and uses AutoModelForImageTextToText or the model-specific loader. The model consists of a vision encoder, a projection layer, and a language model backbone.

Key considerations:

Model loading uses multimodal-specific auto classes
Vision encoder weights can be frozen or trainable depending on configuration
Use SDPA attention (sdp_attention: true) for vision models (Flash Attention 2 may not support cross-attention)
Memory requirements are higher due to the vision encoder component

Step 4: Multimodal LoRA Adapter Configuration

Configure and inject LoRA adapters targeting the appropriate layers of the vision-language model. For VLMs, adapters are typically applied to the language model's attention and MLP layers, and optionally to the cross-attention layers that bridge vision and language. A regex pattern can be used to precisely target specific layer types.

Key considerations:

Use regex patterns in lora_target_modules for precise layer targeting
Target language model layers: self_attn, mlp projections
Optionally target cross-attention layers: cross_attn projections
Vision encoder layers are typically excluded from LoRA to preserve visual features

Step 5: Multimodal Training Execution

Execute the training loop with multimodal batch handling. The trainer processes batches containing both pixel values (image tensors) and input IDs (text tokens), computing the language modeling loss only on text generation tokens (not image tokens). The vision encoder processes images into feature vectors that are projected into the language model's embedding space.

Key considerations:

Batch size is typically smaller due to image tensor memory requirements
Gradient checkpointing is essential for memory management
Image resolution affects both memory and training quality
Loss is computed only on text generation tokens, not image placeholder tokens

Step 6: Model Saving

Save the trained adapter weights, tokenizer, and processor configuration. For multimodal models, the processor configuration is also saved to ensure correct image preprocessing during inference. The adapter can optionally be merged into the base model.

Key considerations:

Processor configuration is saved alongside the model and tokenizer
Adapter weights remain small even for multimodal models
Verify the processor configuration is correct for deployment
Test inference with sample images before deployment

Execution Diagram

GitHub URL

Workflow Repository