Principle:OpenGVLab InternVL Vision Language Projection
| Knowledge Sources | |
|---|---|
| Domains | Multimodal Models, Vision-Language, LLaVA |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
The vision-language projection principle defines the strategy for bridging the dimensionality gap between vision encoder hidden states and the language model embedding space using configurable projection modules.
Description
In vision-language models, the vision encoder and language model typically operate in different embedding spaces with different dimensionalities. The mm_projector bridges this gap by transforming vision features into the language model's input space. This principle supports multiple projection strategies:
- Linear projection: A single linear layer -- the simplest and most parameter-efficient approach, used in LLaVA v1.0.
- MLP projection (mlpNx_gelu): An N-layer MLP with GELU activations, providing nonlinear expressiveness for richer feature transformation. Optionally prefixed with LayerNorm for input normalization. The 2-layer variant (mlp2x_gelu) is standard in LLaVA v1.5.
- Identity projection: A passthrough that assumes vision and language dimensions already match.
- Dual-branch projection (TwoMLP): Separate MLPs for ViT patch features and query features (from InternVL-14B), concatenated to produce a combined representation.
- Residual projection: A residual block with LayerNorm for gradient-friendly feature refinement.
The choice of projector type is controlled by a single configuration string, making it an easily configurable hyperparameter.
Usage
Apply this principle when designing the connection layer between a vision encoder and a language model, choosing the projector type based on the desired expressiveness-efficiency trade-off.
Theoretical Basis
The projection layer serves as the "connector" between pre-trained modality-specific encoders, following the paradigm established by BLIP-2 and LLaVA. Research shows that even simple linear projections can be effective when the vision encoder is strong, while MLPs provide marginal improvements in certain scenarios.