Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mit han lab Llm awq Build vision projector

From Leeroopedia
Revision as of 13:15, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Mit_han_lab_Llm_awq_Build_vision_projector.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Vision, Model_Architecture
Last Updated 2026-02-15 00:00 GMT

Overview

Factory function and helper modules for building vision-to-language projection layers that bridge CLIP vision features to the language model embedding space in LLaVA.

Description

This module provides the build_vision_projector factory function that constructs the appropriate projection module based on the mm_projector_type in the model configuration. Four projector types are supported:

  • "linear": A single nn.Linear layer mapping from mm_hidden_size to hidden_size, providing the simplest feature space transformation.
  • "mlp{N}x_gelu": A multi-layer perceptron with N layers and GELU activations. The first layer maps from mm_hidden_size to hidden_size, and subsequent layers maintain hidden_size dimensions with GELU non-linearities between them. The depth N is extracted via regex matching (e.g., "mlp2x_gelu" produces a 2-layer MLP).
  • "identity": Uses the IdentityMap class, a pass-through module that returns input unchanged. Its config property reports the projector type for serialization.
  • "linearclip": Combines a linear projection with a RangeClip module that clamps output values to a pre-loaded min/max range. The range is loaded from a file specified by config.min_max_range_path and registered as buffers for proper device handling.

SimpleResBlock is a residual block with LayerNorm pre-normalization and a two-layer MLP with GELU activation, available as a building block though not directly used by the factory function.

Usage

Import build_vision_projector when initializing a LLaVA model to construct the mm_projector module. The config object must have mm_hidden_size and hidden_size attributes.

Code Reference

Source Location

Signature

class IdentityMap(nn.Module):
    def forward(self, x, *args, **kwargs) -> torch.Tensor: ...
    @property
    def config(self) -> dict: ...

class SimpleResBlock(nn.Module):
    def __init__(self, channels: int): ...
    def forward(self, x: torch.Tensor) -> torch.Tensor: ...

def build_vision_projector(config, delay_load=False, **kwargs) -> nn.Module: ...

Import

from tinychat.models.llava_base.multimodal_projector.builder import build_vision_projector

I/O Contract

Inputs

Name Type Required Description
config object Yes Configuration with mm_projector_type, mm_hidden_size, hidden_size, and optionally min_max_range_path
delay_load bool No Reserved for future use; currently unused

Outputs

Name Type Description
projector nn.Module A PyTorch module that maps vision features (mm_hidden_size) to language embedding space (hidden_size)

Usage Examples

Building a linear projector

from tinychat.models.llava_base.multimodal_projector.builder import build_vision_projector

class Config:
    mm_projector_type = "linear"
    mm_hidden_size = 1024
    hidden_size = 4096

projector = build_vision_projector(Config())
# projector is nn.Linear(1024, 4096)

Building a 2-layer MLP projector

class Config:
    mm_projector_type = "mlp2x_gelu"
    mm_hidden_size = 1024
    hidden_size = 4096

projector = build_vision_projector(Config())
# projector is nn.Sequential(Linear(1024, 4096), GELU(), Linear(4096, 4096))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment