Principle:OpenGVLab InternVL Vision Encoder LoRA

Knowledge Sources	LoRA InternVL
Domains	Parameter_Efficient_Finetuning, Computer_Vision
Last Updated	2026-02-07 00:00 GMT

Overview

Application of Low-Rank Adaptation to the vision encoder (InternViT) component of a vision-language model, enabling parameter-efficient fine-tuning of visual representations.

Description

While LoRA is most commonly applied to the language model, InternVL also supports applying LoRA to the vision encoder (InternViT). This is useful when the vision encoder needs task-specific adaptation (e.g., for medical imaging or remote sensing) but full fine-tuning is too expensive.

Vision encoder LoRA targets the attention and MLP layers of InternViT:

Attention: attn.qkv, attn.proj
MLP: mlp.fc1, mlp.fc2

This is controlled by the use_backbone_lora argument in ModelArguments. When set to a positive integer (the LoRA rank), adapters are injected into the vision encoder.

Usage

Use vision encoder LoRA when you need to adapt the visual representation for a specific domain while keeping most parameters frozen. Less common than LLM LoRA; typically used alongside LLM LoRA for domain-specific adaptation.

Theoretical Basis

Same LoRA formulation as LLM LoRA ( $h = W x + \frac{α}{r} B A x$ ), applied to vision transformer layers instead of language model layers.

Target modules for InternViT:

attn.qkv: Combined query/key/value projection
attn.proj: Output projection
mlp.fc1: First MLP layer
mlp.fc2: Second MLP layer

Related Pages

Implemented By

Implementation:OpenGVLab_InternVL_Wrap_Backbone_LoRA

Uses Heuristic

Heuristic:OpenGVLab_InternVL_LoRA_Alpha_Scaling

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment