Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL Build Vision Projector

From Leeroopedia


Knowledge Sources
Domains Multimodal Models, Vision-Language, LLaVA
Last Updated 2026-02-07 14:00 GMT

Overview

Factory function and supporting modules for building the vision-to-language projection layer (mm_projector) that bridges vision encoder outputs to the language model embedding space in LLaVA.

Description

The build_vision_projector() factory function creates the appropriate projection module based on the mm_projector_type configuration string. Supported projector types include:

  • "linear" -- A single nn.Linear layer mapping from mm_hidden_size to hidden_size
  • "mlpNx_gelu" -- An N-layer MLP with GELU activations, where N is parsed from the type string (e.g., "mlp2x_gelu"); optionally prefixed with LayerNorm when "ln" appears in the type string
  • "identity" -- An IdentityMap passthrough module that returns input unchanged
  • "two_mlp" -- A TwoMLP module designed for InternVL-14B that uses separate MLP branches for ViT features (hardcoded 3200-dim input) and query features (mm_hidden_size input), concatenating them to produce 576+96=672 tokens
  • SimpleResBlock -- A residual block with LayerNorm, two linear layers, and GELU activation

The MLP projector with GELU (typically mlp2x_gelu) is the most commonly used configuration in LLaVA models, providing a two-layer nonlinear projection from the vision hidden space to the language hidden space.

Usage

Use this factory function when initializing the multimodal projector in any LLaVA model variant. It is called by LlavaMetaModel during model construction and vision module initialization.

Code Reference

Source Location

Signature

class IdentityMap(nn.Module):
    def forward(self, x, *args, **kwargs): ...

class SimpleResBlock(nn.Module):
    def __init__(self, channels): ...
    def forward(self, x): ...

class TwoMLP(nn.Module):
    def __init__(self, config): ...
    def forward(self, inputs): ...

def build_vision_projector(config, delay_load=False, **kwargs): ...

Import

from llava.model.multimodal_projector.builder import build_vision_projector

I/O Contract

Inputs (build_vision_projector)

Name Type Required Description
config PretrainedConfig Yes Model config with mm_projector_type, mm_hidden_size, and hidden_size attributes
delay_load bool No Whether to delay loading (default: False)

Outputs

Name Type Description
projector nn.Module The constructed projection module (Linear, Sequential MLP, IdentityMap, TwoMLP, or SimpleResBlock)

Usage Examples

Basic Usage

from llava.model.multimodal_projector.builder import build_vision_projector

# Build a 2-layer MLP projector with GELU
config.mm_projector_type = "mlp2x_gelu"
config.mm_hidden_size = 1024  # vision encoder hidden size
config.hidden_size = 4096     # LLM hidden size
projector = build_vision_projector(config)

# Forward: project vision features to language space
language_features = projector(vision_features)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment