Workflow:OpenGVLab InternVL LoRA Finetuning
| Knowledge Sources | |
|---|---|
| Domains | VLMs, Fine_Tuning, PEFT, Multimodal |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
End-to-end process for parameter-efficient fine-tuning of InternVL models using Low-Rank Adaptation (LoRA) on custom multimodal datasets.
Description
This workflow enables domain-specific adaptation of InternVL models with minimal GPU requirements by injecting trainable low-rank adapter matrices into the language model while freezing all other parameters. Only the LoRA adapter weights are updated during training, reducing memory consumption and training time by orders of magnitude compared to full fine-tuning. After training, the LoRA adapters can be merged back into the base model to produce a standalone checkpoint with no inference overhead.
Usage
Execute this workflow when you need to adapt an InternVL model to domain-specific data but have limited GPU resources (as few as 2 GPUs with 80GB VRAM). This is the recommended starting point for custom data adaptation. LoRA training is faster, requires less memory, and produces smaller checkpoint files. Use full fine-tuning instead only when LoRA quality is insufficient for your task.
Execution Steps
Step 1: Prepare Training Data
Format your custom dataset into JSONL files following the InternVL conversation schema. The data format is identical to full fine-tuning: each line contains a JSON object with optional image/video paths and a conversations array with alternating human and gpt turns. Create a JSON meta-file referencing the JSONL shards with sampling weights.
Key considerations:
- The data format is identical to the full fine-tuning workflow
- Each conversation entry needs from (human/gpt) and value fields
- Multi-image and video inputs are supported
- Use the provided json2jsonl and jsonl2jsonl tools to convert and clean existing data
Step 2: Configure LoRA Parameters
Select the LoRA rank and target modules. The default configuration applies LoRA with rank 16 to the LLM component only. Optionally, LoRA can also be applied to the vision backbone. All original model parameters remain frozen, and only the small adapter matrices are trained.
Key considerations:
- Default LoRA rank is 16 (controls adapter size and expressiveness)
- LoRA is applied to the LLM by default; the vision backbone and MLP remain frozen
- Optional backbone LoRA can be enabled for adapting visual features
- Drop path rate is set to 0.0 (no stochastic depth) since most parameters are frozen
- Requires only 2 GPUs compared to 8 for full fine-tuning
Step 3: Load Model and Inject Adapters
Load the pre-trained InternVL model and apply LoRA adapters using the PEFT library. The model components are configured as follows: vision backbone frozen, MLP projector frozen, LLM frozen but with LoRA adapters injected. The adapter weights are the only trainable parameters.
Key considerations:
- The PEFT library handles LoRA injection automatically based on the configuration
- Trainable parameters are typically less than 1% of total model parameters
- DeepSpeed ZeRO Stage 1 is sufficient since memory requirements are much lower
- The base model weights are kept in the original precision (bfloat16)
Step 4: Train LoRA Adapters
Launch distributed training with the same HuggingFace Trainer infrastructure used for full fine-tuning. The training loop only updates the LoRA adapter weights. Loss computation, data loading, and gradient handling work identically to the full fine-tuning path.
Key considerations:
- Training is significantly faster due to fewer trainable parameters
- Per-device batch size of 4 with gradient accumulation of 2 on 2 GPUs (total batch 16)
- Learning rate of 4e-5 with cosine scheduler
- Checkpoint saves include only the adapter weights (small files)
Step 5: Merge LoRA Adapters
After training, merge the LoRA adapter weights back into the base model to produce a standalone checkpoint. The merge tool loads the model with adapters and calls the PEFT merge_and_unload method, producing a full model that requires no adapter overhead at inference time.
Key considerations:
- Merging is performed using the provided merge_lora.py tool
- Both LLM LoRA and optional backbone LoRA can be merged in a single operation
- The merged model is saved with the tokenizer in HuggingFace format
- After merging, the model behaves identically to a fully fine-tuned model
Step 6: Validate Merged Model
Load the merged model and verify it produces correct outputs. The merged model can be used with the standard InternVLChatModel inference API without any adapter-specific code.
Key considerations:
- Test on representative samples from your domain
- Compare outputs against the base model to verify adaptation
- The merged model is compatible with all InternVL inference and evaluation tools