Implementation:OpenGVLab InternVL Build Vision Projector
| Knowledge Sources | |
|---|---|
| Domains | Multimodal Models, Vision-Language, LLaVA |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Factory function and supporting modules for building the vision-to-language projection layer (mm_projector) that bridges vision encoder outputs to the language model embedding space in LLaVA.
Description
The build_vision_projector() factory function creates the appropriate projection module based on the mm_projector_type configuration string. Supported projector types include:
- "linear" -- A single nn.Linear layer mapping from mm_hidden_size to hidden_size
- "mlpNx_gelu" -- An N-layer MLP with GELU activations, where N is parsed from the type string (e.g., "mlp2x_gelu"); optionally prefixed with LayerNorm when "ln" appears in the type string
- "identity" -- An IdentityMap passthrough module that returns input unchanged
- "two_mlp" -- A TwoMLP module designed for InternVL-14B that uses separate MLP branches for ViT features (hardcoded 3200-dim input) and query features (mm_hidden_size input), concatenating them to produce 576+96=672 tokens
- SimpleResBlock -- A residual block with LayerNorm, two linear layers, and GELU activation
The MLP projector with GELU (typically mlp2x_gelu) is the most commonly used configuration in LLaVA models, providing a two-layer nonlinear projection from the vision hidden space to the language hidden space.
Usage
Use this factory function when initializing the multimodal projector in any LLaVA model variant. It is called by LlavaMetaModel during model construction and vision module initialization.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/model/multimodal_projector/builder.py
- Lines: 1-84
Signature
class IdentityMap(nn.Module):
def forward(self, x, *args, **kwargs): ...
class SimpleResBlock(nn.Module):
def __init__(self, channels): ...
def forward(self, x): ...
class TwoMLP(nn.Module):
def __init__(self, config): ...
def forward(self, inputs): ...
def build_vision_projector(config, delay_load=False, **kwargs): ...
Import
from llava.model.multimodal_projector.builder import build_vision_projector
I/O Contract
Inputs (build_vision_projector)
| Name | Type | Required | Description |
|---|---|---|---|
| config | PretrainedConfig | Yes | Model config with mm_projector_type, mm_hidden_size, and hidden_size attributes |
| delay_load | bool | No | Whether to delay loading (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| projector | nn.Module | The constructed projection module (Linear, Sequential MLP, IdentityMap, TwoMLP, or SimpleResBlock) |
Usage Examples
Basic Usage
from llava.model.multimodal_projector.builder import build_vision_projector
# Build a 2-layer MLP projector with GELU
config.mm_projector_type = "mlp2x_gelu"
config.mm_hidden_size = 1024 # vision encoder hidden size
config.hidden_size = 4096 # LLM hidden size
projector = build_vision_projector(config)
# Forward: project vision features to language space
language_features = projector(vision_features)