Implementation:OpenGVLab InternVL Build Vision Projector

Knowledge Sources	OpenGVLab_InternVL
Domains	Multimodal Models, Vision-Language, LLaVA
Last Updated	2026-02-07 14:00 GMT

Overview

Factory function and supporting modules for building the vision-to-language projection layer (mm_projector) that bridges vision encoder outputs to the language model embedding space in LLaVA.

Description

The build_vision_projector() factory function creates the appropriate projection module based on the mm_projector_type configuration string. Supported projector types include:

"linear" -- A single nn.Linear layer mapping from mm_hidden_size to hidden_size
"mlpNx_gelu" -- An N-layer MLP with GELU activations, where N is parsed from the type string (e.g., "mlp2x_gelu"); optionally prefixed with LayerNorm when "ln" appears in the type string
"identity" -- An IdentityMap passthrough module that returns input unchanged
"two_mlp" -- A TwoMLP module designed for InternVL-14B that uses separate MLP branches for ViT features (hardcoded 3200-dim input) and query features (mm_hidden_size input), concatenating them to produce 576+96=672 tokens
SimpleResBlock -- A residual block with LayerNorm, two linear layers, and GELU activation

The MLP projector with GELU (typically mlp2x_gelu) is the most commonly used configuration in LLaVA models, providing a two-layer nonlinear projection from the vision hidden space to the language hidden space.

Usage

Use this factory function when initializing the multimodal projector in any LLaVA model variant. It is called by LlavaMetaModel during model construction and vision module initialization.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: internvl_chat_llava/llava/model/multimodal_projector/builder.py
Lines: 1-84

Signature

class IdentityMap(nn.Module):
    def forward(self, x, *args, **kwargs): ...

class SimpleResBlock(nn.Module):
    def __init__(self, channels): ...
    def forward(self, x): ...

class TwoMLP(nn.Module):
    def __init__(self, config): ...
    def forward(self, inputs): ...

def build_vision_projector(config, delay_load=False, **kwargs): ...

Import

from llava.model.multimodal_projector.builder import build_vision_projector

I/O Contract

Inputs (build_vision_projector)

Name	Type	Required	Description
config	PretrainedConfig	Yes	Model config with mm_projector_type, mm_hidden_size, and hidden_size attributes
delay_load	bool	No	Whether to delay loading (default: False)

Outputs

Name	Type	Description
projector	nn.Module	The constructed projection module (Linear, Sequential MLP, IdentityMap, TwoMLP, or SimpleResBlock)

Usage Examples

Basic Usage

from llava.model.multimodal_projector.builder import build_vision_projector

# Build a 2-layer MLP projector with GELU
config.mm_projector_type = "mlp2x_gelu"
config.mm_hidden_size = 1024  # vision encoder hidden size
config.hidden_size = 4096     # LLM hidden size
projector = build_vision_projector(config)

# Forward: project vision features to language space
language_features = projector(vision_features)

Related Pages

Principle:OpenGVLab_InternVL_Vision_Language_Projection

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment