Principle:OpenGVLab InternVL Vision Language Projection

Knowledge Sources	OpenGVLab_InternVL
Domains	Multimodal Models, Vision-Language, LLaVA
Last Updated	2026-02-07 14:00 GMT

Overview

The vision-language projection principle defines the strategy for bridging the dimensionality gap between vision encoder hidden states and the language model embedding space using configurable projection modules.

Description

In vision-language models, the vision encoder and language model typically operate in different embedding spaces with different dimensionalities. The mm_projector bridges this gap by transforming vision features into the language model's input space. This principle supports multiple projection strategies:

Linear projection: A single linear layer -- the simplest and most parameter-efficient approach, used in LLaVA v1.0.
MLP projection (mlpNx_gelu): An N-layer MLP with GELU activations, providing nonlinear expressiveness for richer feature transformation. Optionally prefixed with LayerNorm for input normalization. The 2-layer variant (mlp2x_gelu) is standard in LLaVA v1.5.
Identity projection: A passthrough that assumes vision and language dimensions already match.
Dual-branch projection (TwoMLP): Separate MLPs for ViT patch features and query features (from InternVL-14B), concatenated to produce a combined representation.
Residual projection: A residual block with LayerNorm for gradient-friendly feature refinement.

The choice of projector type is controlled by a single configuration string, making it an easily configurable hyperparameter.

Usage

Apply this principle when designing the connection layer between a vision encoder and a language model, choosing the projector type based on the desired expressiveness-efficiency trade-off.

Theoretical Basis

The projection layer serves as the "connector" between pre-trained modality-specific encoders, following the paradigm established by BLIP-2 and LLaVA. Research shows that even simple linear projections can be effective when the vision encoder is strong, while MLPs provide marginal improvements in certain scenarios.

Related Pages

Implementation:OpenGVLab_InternVL_Build_Vision_Projector

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment