Workflow:OpenGVLab InternVL LoRA Finetuning

Knowledge Sources	InternVL InternVL Chat README PEFT LoRA
Domains	VLMs, Fine_Tuning, PEFT, Multimodal
Last Updated	2026-02-07 14:00 GMT

Overview

End-to-end process for parameter-efficient fine-tuning of InternVL models using Low-Rank Adaptation (LoRA) on custom multimodal datasets.

Description

This workflow enables domain-specific adaptation of InternVL models with minimal GPU requirements by injecting trainable low-rank adapter matrices into the language model while freezing all other parameters. Only the LoRA adapter weights are updated during training, reducing memory consumption and training time by orders of magnitude compared to full fine-tuning. After training, the LoRA adapters can be merged back into the base model to produce a standalone checkpoint with no inference overhead.

Usage

Execute this workflow when you need to adapt an InternVL model to domain-specific data but have limited GPU resources (as few as 2 GPUs with 80GB VRAM). This is the recommended starting point for custom data adaptation. LoRA training is faster, requires less memory, and produces smaller checkpoint files. Use full fine-tuning instead only when LoRA quality is insufficient for your task.

Execution Steps

Step 1: Prepare Training Data

Format your custom dataset into JSONL files following the InternVL conversation schema. The data format is identical to full fine-tuning: each line contains a JSON object with optional image/video paths and a conversations array with alternating human and gpt turns. Create a JSON meta-file referencing the JSONL shards with sampling weights.

Key considerations:

The data format is identical to the full fine-tuning workflow
Each conversation entry needs from (human/gpt) and value fields
Multi-image and video inputs are supported
Use the provided json2jsonl and jsonl2jsonl tools to convert and clean existing data

Step 2: Configure LoRA Parameters

Select the LoRA rank and target modules. The default configuration applies LoRA with rank 16 to the LLM component only. Optionally, LoRA can also be applied to the vision backbone. All original model parameters remain frozen, and only the small adapter matrices are trained.

Key considerations:

Default LoRA rank is 16 (controls adapter size and expressiveness)
LoRA is applied to the LLM by default; the vision backbone and MLP remain frozen
Optional backbone LoRA can be enabled for adapting visual features
Drop path rate is set to 0.0 (no stochastic depth) since most parameters are frozen
Requires only 2 GPUs compared to 8 for full fine-tuning

Step 3: Load Model and Inject Adapters

Load the pre-trained InternVL model and apply LoRA adapters using the PEFT library. The model components are configured as follows: vision backbone frozen, MLP projector frozen, LLM frozen but with LoRA adapters injected. The adapter weights are the only trainable parameters.

Key considerations:

The PEFT library handles LoRA injection automatically based on the configuration
Trainable parameters are typically less than 1% of total model parameters
DeepSpeed ZeRO Stage 1 is sufficient since memory requirements are much lower
The base model weights are kept in the original precision (bfloat16)

Step 4: Train LoRA Adapters

Launch distributed training with the same HuggingFace Trainer infrastructure used for full fine-tuning. The training loop only updates the LoRA adapter weights. Loss computation, data loading, and gradient handling work identically to the full fine-tuning path.

Key considerations:

Training is significantly faster due to fewer trainable parameters
Per-device batch size of 4 with gradient accumulation of 2 on 2 GPUs (total batch 16)
Learning rate of 4e-5 with cosine scheduler
Checkpoint saves include only the adapter weights (small files)

Step 5: Merge LoRA Adapters

After training, merge the LoRA adapter weights back into the base model to produce a standalone checkpoint. The merge tool loads the model with adapters and calls the PEFT merge_and_unload method, producing a full model that requires no adapter overhead at inference time.

Key considerations:

Merging is performed using the provided merge_lora.py tool
Both LLM LoRA and optional backbone LoRA can be merged in a single operation
The merged model is saved with the tokenizer in HuggingFace format
After merging, the model behaves identically to a fully fine-tuned model

Step 6: Validate Merged Model

Load the merged model and verify it produces correct outputs. The merged model can be used with the standard InternVLChatModel inference API without any adapter-specific code.

Key considerations:

Test on representative samples from your domain
Compare outputs against the base model to verify adaptation
The merged model is compatible with all InternVL inference and evaluation tools

Execution Diagram

GitHub URL

Workflow Repository