Workflow:Unslothai Unsloth Vision Model Finetuning
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Vision, VLMs, Fine_Tuning |
| Last Updated | 2026-02-07 09:00 GMT |
Overview
End-to-end process for fine-tuning vision-language models (VLMs) on multimodal datasets using Unsloth's optimized pipeline with LoRA adapters and optional vision RL.
Description
This workflow extends Unsloth's training capabilities to vision-language models such as Qwen2-VL, Qwen2.5-VL, Llava-Next, Pixtral, Gemma 3, and Mistral Ministral 3 (Vision). It uses a specialized FastVisionModel loader that handles multimodal architecture initialization, processor loading (instead of tokenizer-only), and vision-specific quantization. The workflow supports both supervised fine-tuning on image-text pairs and vision reinforcement learning (GRPO/GSPO) for training models to reason about visual content. The same LoRA injection and training optimization techniques from the text-only pipeline apply, with additional handling for image preprocessing and vision encoder management.
Key capabilities:
- Support for multiple VLM architectures (Qwen2-VL, Llava, Pixtral, Gemma 3 Vision, Ministral 3 Vision)
- LoRA fine-tuning of vision-language models with optimized VRAM usage
- Vision RL (GRPO/GSPO) for training visual reasoning capabilities
- Proper handling of multimodal processors (image + text)
- OCR benchmark evaluation for vision model quality validation
Usage
Execute this workflow when you have an image-text dataset (e.g., visual question answering, OCR, image captioning, document understanding) and need to adapt a vision-language model to your domain. This is appropriate when working with models that accept both image and text inputs and produce text outputs.
Execution Steps
Step 1: Multimodal Data Preparation
Prepare the training dataset with image-text pairs in conversation format. Each example contains message dicts with image references and text prompts. Images can be provided as URLs, file paths, or base64-encoded data. Apply the vision model's processor to handle both image preprocessing (resizing, normalization) and text tokenization.
Key considerations:
- Format data as multi-turn conversations with image and text content
- Use the model's native image processing through the AutoProcessor
- Handle mixed single-image and multi-image examples
- Process vision info to extract image inputs from conversation format
Step 2: Vision Model Loading
Initialize the vision-language model through Unsloth's FastVisionModel loader, which extends the standard loader with vision-specific handling. The loader configures the multimodal architecture, loads the processor (which includes both tokenizer and image processor), applies quantization to the language model backbone while preserving the vision encoder, and sets up attention optimizations.
Key considerations:
- FastVisionModel uses AutoProcessor instead of AutoTokenizer
- Vision encoder parameters are typically frozen during LoRA fine-tuning
- Quantization applies to the language model backbone, not the vision encoder
- Some VLMs require trust_remote_code=True for custom architectures
Step 3: LoRA Adapter Injection
Inject LoRA adapters into the language model portion of the VLM. The adapter targets are the same attention and feedforward projection layers as in text-only fine-tuning. The vision encoder remains frozen, and only the language model's LoRA weights are trained, along with any cross-attention layers that connect the vision and language components.
Key considerations:
- Target modules typically match text-only fine-tuning (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj)
- Vision encoder weights remain frozen
- Cross-attention layers may also receive LoRA adapters depending on the architecture
- Use gradient checkpointing to manage VRAM with large vision encoders
Step 4: Vision SFT Training
Execute supervised fine-tuning on the image-text dataset using TRL's SFTTrainer with Unsloth's optimizations. The training loop processes batched image-text pairs, computing loss only on the text generation tokens while handling variable-size image inputs through the processor's collation.
Key considerations:
- Batch size may need to be smaller than text-only training due to vision encoder memory
- Use DataCollatorForSeq2Seq for proper sequence padding
- Monitor training loss for convergence
- Optionally use train_on_responses_only for instruction-tuned VLMs
Step 5: Model Saving and Evaluation
Save the fine-tuned vision model and evaluate its performance on vision benchmarks. The merge process handles the multimodal architecture, preserving both the vision encoder and the merged language model. Evaluation can include OCR benchmarks (WER/CER) for document understanding models or VQA accuracy for question-answering models.
Key considerations:
- Save supports merged SafeTensors and GGUF export
- OCR benchmark evaluation validates text extraction quality
- Hub upload preserves the full multimodal architecture
- Sharded SafeTensors for large VLMs (7B+)