Workflow:Unslothai Unsloth Vision Model Finetuning

Knowledge Sources	Unsloth Unsloth Docs Vision Fine-tuning Vision RL
Domains	LLMs, Vision, VLMs, Fine_Tuning
Last Updated	2026-02-07 09:00 GMT

Overview

End-to-end process for fine-tuning vision-language models (VLMs) on multimodal datasets using Unsloth's optimized pipeline with LoRA adapters and optional vision RL.

Description

This workflow extends Unsloth's training capabilities to vision-language models such as Qwen2-VL, Qwen2.5-VL, Llava-Next, Pixtral, Gemma 3, and Mistral Ministral 3 (Vision). It uses a specialized FastVisionModel loader that handles multimodal architecture initialization, processor loading (instead of tokenizer-only), and vision-specific quantization. The workflow supports both supervised fine-tuning on image-text pairs and vision reinforcement learning (GRPO/GSPO) for training models to reason about visual content. The same LoRA injection and training optimization techniques from the text-only pipeline apply, with additional handling for image preprocessing and vision encoder management.

Key capabilities:

Support for multiple VLM architectures (Qwen2-VL, Llava, Pixtral, Gemma 3 Vision, Ministral 3 Vision)
LoRA fine-tuning of vision-language models with optimized VRAM usage
Vision RL (GRPO/GSPO) for training visual reasoning capabilities
Proper handling of multimodal processors (image + text)
OCR benchmark evaluation for vision model quality validation

Usage

Execute this workflow when you have an image-text dataset (e.g., visual question answering, OCR, image captioning, document understanding) and need to adapt a vision-language model to your domain. This is appropriate when working with models that accept both image and text inputs and produce text outputs.

Execution Steps

Step 1: Multimodal Data Preparation

Prepare the training dataset with image-text pairs in conversation format. Each example contains message dicts with image references and text prompts. Images can be provided as URLs, file paths, or base64-encoded data. Apply the vision model's processor to handle both image preprocessing (resizing, normalization) and text tokenization.

Key considerations:

Format data as multi-turn conversations with image and text content
Use the model's native image processing through the AutoProcessor
Handle mixed single-image and multi-image examples
Process vision info to extract image inputs from conversation format

Step 2: Vision Model Loading

Initialize the vision-language model through Unsloth's FastVisionModel loader, which extends the standard loader with vision-specific handling. The loader configures the multimodal architecture, loads the processor (which includes both tokenizer and image processor), applies quantization to the language model backbone while preserving the vision encoder, and sets up attention optimizations.

Key considerations:

FastVisionModel uses AutoProcessor instead of AutoTokenizer
Vision encoder parameters are typically frozen during LoRA fine-tuning
Quantization applies to the language model backbone, not the vision encoder
Some VLMs require trust_remote_code=True for custom architectures

Step 3: LoRA Adapter Injection

Inject LoRA adapters into the language model portion of the VLM. The adapter targets are the same attention and feedforward projection layers as in text-only fine-tuning. The vision encoder remains frozen, and only the language model's LoRA weights are trained, along with any cross-attention layers that connect the vision and language components.

Key considerations:

Target modules typically match text-only fine-tuning (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj)
Vision encoder weights remain frozen
Cross-attention layers may also receive LoRA adapters depending on the architecture
Use gradient checkpointing to manage VRAM with large vision encoders

Step 4: Vision SFT Training

Execute supervised fine-tuning on the image-text dataset using TRL's SFTTrainer with Unsloth's optimizations. The training loop processes batched image-text pairs, computing loss only on the text generation tokens while handling variable-size image inputs through the processor's collation.

Key considerations:

Batch size may need to be smaller than text-only training due to vision encoder memory
Use DataCollatorForSeq2Seq for proper sequence padding
Monitor training loss for convergence
Optionally use train_on_responses_only for instruction-tuned VLMs

Step 5: Model Saving and Evaluation

Save the fine-tuned vision model and evaluate its performance on vision benchmarks. The merge process handles the multimodal architecture, preserving both the vision encoder and the merged language model. Evaluation can include OCR benchmarks (WER/CER) for document understanding models or VQA accuracy for question-answering models.

Key considerations:

Save supports merged SafeTensors and GGUF export
OCR benchmark evaluation validates text extraction quality
Hub upload preserves the full multimodal architecture
Sharded SafeTensors for large VLMs (7B+)

Execution Diagram

GitHub URL

Workflow Repository