Principle:OpenGVLab InternVL Vision Encoder LoRA
| Knowledge Sources | |
|---|---|
| Domains | Parameter_Efficient_Finetuning, Computer_Vision |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Application of Low-Rank Adaptation to the vision encoder (InternViT) component of a vision-language model, enabling parameter-efficient fine-tuning of visual representations.
Description
While LoRA is most commonly applied to the language model, InternVL also supports applying LoRA to the vision encoder (InternViT). This is useful when the vision encoder needs task-specific adaptation (e.g., for medical imaging or remote sensing) but full fine-tuning is too expensive.
Vision encoder LoRA targets the attention and MLP layers of InternViT:
- Attention: attn.qkv, attn.proj
- MLP: mlp.fc1, mlp.fc2
This is controlled by the use_backbone_lora argument in ModelArguments. When set to a positive integer (the LoRA rank), adapters are injected into the vision encoder.
Usage
Use vision encoder LoRA when you need to adapt the visual representation for a specific domain while keeping most parameters frozen. Less common than LLM LoRA; typically used alongside LLM LoRA for domain-specific adaptation.
Theoretical Basis
Same LoRA formulation as LLM LoRA (), applied to vision transformer layers instead of language model layers.
Target modules for InternViT:
- attn.qkv: Combined query/key/value projection
- attn.proj: Output projection
- mlp.fc1: First MLP layer
- mlp.fc2: Second MLP layer