Workflow:Haotian liu LLaVA LoRA Finetuning

Knowledge Sources	LLaVA LLaVA LoRA Guide LLaVA Custom Data Guide
Domains	LLMs, Fine_Tuning, Multimodal, PEFT
Last Updated	2026-02-13 23:00 GMT

Overview

Parameter-efficient fine-tuning of a pre-trained LLaVA model using Low-Rank Adaptation (LoRA) or QLoRA for task-specific customization on limited hardware.

Description

This workflow adapts a pre-trained LLaVA checkpoint to a specific task or domain using LoRA (Low-Rank Adaptation). Instead of updating all model parameters, LoRA injects small trainable rank-decomposition matrices into the language model's attention and feedforward layers, reducing GPU memory requirements and training time while preserving the base model's general capabilities.

Two variants are supported:

LoRA: Standard low-rank adaptation with 16-bit precision. Uses DeepSpeed ZeRO-3 for distributed training.
QLoRA: Combines 4-bit quantization (NF4) with LoRA adapters, enabling training of 13B+ models on a single GPU with DeepSpeed ZeRO-2.

After training, LoRA weights can optionally be merged back into the base model to produce a standalone checkpoint.

Usage

Execute this workflow when:

You have a pre-trained LLaVA checkpoint and want to adapt it to a specific task or domain
You have limited GPU resources (single GPU with QLoRA, or multi-GPU with LoRA)
You have a custom dataset in the LLaVA conversation JSON format
You want to preserve the option of reverting to the base model (LoRA adapters are separate)

Execution Steps

Step 1: Dataset Preparation

Convert your custom dataset into LLaVA's expected JSON format. Each sample requires an id (unique identifier), image (relative path to the image file), and conversations (a list of human/gpt turn pairs). The first human message should include the <image> token placeholder to indicate where the visual input is inserted.

Key considerations:

Use the standard conversation format: {"from": "human", "value": "<image>\nYour question"} and {"from": "gpt", "value": "The answer"}
Image paths are relative to the image_folder argument
Multi-turn conversations are supported within a single sample

Step 2: Configure Training Parameters

Select between LoRA and QLoRA based on available hardware, and configure hyperparameters. Key decisions include the LoRA rank (r) and alpha scaling factor, which control the capacity of the adaptation. A separate learning rate can be set for the multimodal projector to allow differential training.

Key considerations:

LoRA default configuration: r=128, alpha=256, with DeepSpeed ZeRO-3
QLoRA adds --bits 4 for 4-bit NF4 quantization, with DeepSpeed ZeRO-2
Projector learning rate (mm_projector_lr) is typically set to 2e-5 (lower than adapter LR of 2e-4)
For task-specific finetuning from LLaVA checkpoints, start from liuhaotian/llava-v1.5-13b

Step 3: Launch LoRA Training

Execute the training script via DeepSpeed, which loads the pre-trained LLaVA model, injects LoRA adapters into the language model layers, and trains on the custom dataset. The training pipeline uses the custom LLaVA Trainer with modality-length-grouped sampling for efficient batching of mixed image/text data.

What happens:

The base LLaVA model is loaded with its vision tower and projector
LoRA adapters are injected into the language model (attention q/k/v/o and MLP layers)
Only the LoRA parameters and the multimodal projector are trainable
Training uses lazy preprocessing for memory-efficient data loading
Output: LoRA adapter weights, non-LoRA trainable weights (projector), and model config

Step 4: Merge LoRA Weights (Optional)

Merge the trained LoRA adapter weights back into the base model to create a standalone checkpoint that can be loaded without the PEFT library. This step requires specifying both the LoRA checkpoint path and the original base model path.

What happens:

The base model is loaded and LoRA weights are applied via PEFT
Non-LoRA trainable weights (projector, embeddings) are loaded separately
merge_and_unload() folds the low-rank matrices back into the full-rank weights
The merged model and tokenizer are saved as a standard HuggingFace checkpoint

Step 5: Validate Finetuned Model

Load the finetuned model (either as LoRA adapter + base or as merged checkpoint) and run test inferences to verify task-specific behavior. For LoRA models served without merging, the --model-base argument must be provided at inference time.

Key considerations:

Unmerged LoRA models require --model-base pointing to the original base model
Merged models can be loaded directly like any standard LLaVA checkpoint
Test on representative examples from the target task to verify adaptation quality

Execution Diagram

GitHub URL

Workflow Repository