Workflow:Axolotl ai cloud Axolotl Multimodal Vision Finetuning
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, Multimodal, Vision_Language |
| Last Updated | 2026-02-06 22:00 GMT |
Overview
End-to-end process for fine-tuning vision-language models (VLMs) on image-text instruction data using Axolotl's multimodal training pipeline with LoRA adapters.
Description
This workflow covers fine-tuning of multimodal models that process both images and text, such as Llama 3.2 Vision, Qwen2-VL, Pixtral, InternVL, SmolVLM2, and LLaVA. The process extends the standard LoRA fine-tuning workflow with multimodal-specific steps: loading a processor (which handles both image preprocessing and text tokenization), configuring vision-specific chat templates, handling image data within conversation datasets, and targeting LoRA adapters to appropriate model layers (language model, cross-attention, or vision encoder). Axolotl also supports audio modalities (Voxtral, Gemma-3n) through the same pipeline.
Usage
Execute this workflow when you have a dataset containing image-text (or audio-text) instruction pairs and need to adapt a pre-trained vision-language model to understand domain-specific visual content, follow visual instructions more accurately, or learn new visual question-answering capabilities. Typical use cases include document understanding, medical image analysis, visual inspection, and custom visual assistants.
Execution Steps
Step 1: Configuration for Multimodal Training
Create a YAML configuration file specifying the vision-language base model, multimodal dataset, processor type, and vision-specific settings. The configuration must include processor_type: AutoProcessor to handle combined image and text preprocessing, a vision-appropriate chat template, and dataset settings for multimodal content.
Key considerations:
- Set
processor_type: AutoProcessorfor multimodal models - Use the model-specific chat template (e.g.,
llama3_2_vision,qwen2_vl) - Set
skip_prepare_dataset: truefor vision datasets (on-the-fly processing) - Set
remove_unused_columns: falseto preserve image data columns - Disable
sample_packingas it is not compatible with multimodal inputs
Step 2: Dataset and Processor Loading
Load the multimodal dataset and initialize the processor. The processor combines a vision encoder's image preprocessor with the language model's tokenizer. Datasets are loaded in a format that includes image URLs or paths alongside text conversations, and the processor handles image resizing, normalization, and pixel value extraction alongside text tokenization.
Key considerations:
- Datasets must contain image references (URLs, file paths, or PIL images)
- The processor handles both image preprocessing and text tokenization
- Vision-chat datasets follow conversational format with embedded image tokens
- On-the-fly preprocessing is used instead of pre-tokenization
Step 3: Vision Language Model Loading
Load the pre-trained vision-language model with the appropriate model class. Axolotl detects multimodal architectures and uses AutoModelForImageTextToText or the model-specific loader. The model consists of a vision encoder, a projection layer, and a language model backbone.
Key considerations:
- Model loading uses multimodal-specific auto classes
- Vision encoder weights can be frozen or trainable depending on configuration
- Use SDPA attention (
sdp_attention: true) for vision models (Flash Attention 2 may not support cross-attention) - Memory requirements are higher due to the vision encoder component
Step 4: Multimodal LoRA Adapter Configuration
Configure and inject LoRA adapters targeting the appropriate layers of the vision-language model. For VLMs, adapters are typically applied to the language model's attention and MLP layers, and optionally to the cross-attention layers that bridge vision and language. A regex pattern can be used to precisely target specific layer types.
Key considerations:
- Use regex patterns in
lora_target_modulesfor precise layer targeting - Target language model layers:
self_attn,mlpprojections - Optionally target cross-attention layers:
cross_attnprojections - Vision encoder layers are typically excluded from LoRA to preserve visual features
Step 5: Multimodal Training Execution
Execute the training loop with multimodal batch handling. The trainer processes batches containing both pixel values (image tensors) and input IDs (text tokens), computing the language modeling loss only on text generation tokens (not image tokens). The vision encoder processes images into feature vectors that are projected into the language model's embedding space.
Key considerations:
- Batch size is typically smaller due to image tensor memory requirements
- Gradient checkpointing is essential for memory management
- Image resolution affects both memory and training quality
- Loss is computed only on text generation tokens, not image placeholder tokens
Step 6: Model Saving
Save the trained adapter weights, tokenizer, and processor configuration. For multimodal models, the processor configuration is also saved to ensure correct image preprocessing during inference. The adapter can optionally be merged into the base model.
Key considerations:
- Processor configuration is saved alongside the model and tokenizer
- Adapter weights remain small even for multimodal models
- Verify the processor configuration is correct for deployment
- Test inference with sample images before deployment