Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:OpenGVLab InternVL Vision Language Projection

From Leeroopedia


Knowledge Sources
Domains Multimodal Models, Vision-Language, LLaVA
Last Updated 2026-02-07 14:00 GMT

Overview

The vision-language projection principle defines the strategy for bridging the dimensionality gap between vision encoder hidden states and the language model embedding space using configurable projection modules.

Description

In vision-language models, the vision encoder and language model typically operate in different embedding spaces with different dimensionalities. The mm_projector bridges this gap by transforming vision features into the language model's input space. This principle supports multiple projection strategies:

  • Linear projection: A single linear layer -- the simplest and most parameter-efficient approach, used in LLaVA v1.0.
  • MLP projection (mlpNx_gelu): An N-layer MLP with GELU activations, providing nonlinear expressiveness for richer feature transformation. Optionally prefixed with LayerNorm for input normalization. The 2-layer variant (mlp2x_gelu) is standard in LLaVA v1.5.
  • Identity projection: A passthrough that assumes vision and language dimensions already match.
  • Dual-branch projection (TwoMLP): Separate MLPs for ViT patch features and query features (from InternVL-14B), concatenated to produce a combined representation.
  • Residual projection: A residual block with LayerNorm for gradient-friendly feature refinement.

The choice of projector type is controlled by a single configuration string, making it an easily configurable hyperparameter.

Usage

Apply this principle when designing the connection layer between a vision encoder and a language model, choosing the projector type based on the desired expressiveness-efficiency trade-off.

Theoretical Basis

The projection layer serves as the "connector" between pre-trained modality-specific encoders, following the paradigm established by BLIP-2 and LLaVA. Research shows that even simple linear projections can be effective when the vision encoder is strong, while MLPs provide marginal improvements in certain scenarios.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment