Implementation:Mit han lab Llm awq Build vision projector

Knowledge Sources	Mit_han_lab_Llm_awq
Domains	Vision, Model_Architecture
Last Updated	2026-02-15 00:00 GMT

Overview

Factory function and helper modules for building vision-to-language projection layers that bridge CLIP vision features to the language model embedding space in LLaVA.

Description

This module provides the build_vision_projector factory function that constructs the appropriate projection module based on the mm_projector_type in the model configuration. Four projector types are supported:

"linear": A single nn.Linear layer mapping from mm_hidden_size to hidden_size, providing the simplest feature space transformation.
"mlp{N}x_gelu": A multi-layer perceptron with N layers and GELU activations. The first layer maps from mm_hidden_size to hidden_size, and subsequent layers maintain hidden_size dimensions with GELU non-linearities between them. The depth N is extracted via regex matching (e.g., "mlp2x_gelu" produces a 2-layer MLP).
"identity": Uses the IdentityMap class, a pass-through module that returns input unchanged. Its config property reports the projector type for serialization.
"linearclip": Combines a linear projection with a RangeClip module that clamps output values to a pre-loaded min/max range. The range is loaded from a file specified by config.min_max_range_path and registered as buffers for proper device handling.

SimpleResBlock is a residual block with LayerNorm pre-normalization and a two-layer MLP with GELU activation, available as a building block though not directly used by the factory function.

Usage

Import build_vision_projector when initializing a LLaVA model to construct the mm_projector module. The config object must have mm_hidden_size and hidden_size attributes.

Code Reference

Source Location

Repository: Mit_han_lab_Llm_awq
File: tinychat/models/llava_base/multimodal_projector/builder.py
Lines: 1-72

Signature

class IdentityMap(nn.Module):
    def forward(self, x, *args, **kwargs) -> torch.Tensor: ...
    @property
    def config(self) -> dict: ...

class SimpleResBlock(nn.Module):
    def __init__(self, channels: int): ...
    def forward(self, x: torch.Tensor) -> torch.Tensor: ...

def build_vision_projector(config, delay_load=False, **kwargs) -> nn.Module: ...

Import

from tinychat.models.llava_base.multimodal_projector.builder import build_vision_projector

I/O Contract

Inputs

Name	Type	Required	Description
config	object	Yes	Configuration with mm_projector_type, mm_hidden_size, hidden_size, and optionally min_max_range_path
delay_load	bool	No	Reserved for future use; currently unused

Outputs

Name	Type	Description
projector	nn.Module	A PyTorch module that maps vision features (mm_hidden_size) to language embedding space (hidden_size)

Usage Examples

Building a linear projector

from tinychat.models.llava_base.multimodal_projector.builder import build_vision_projector

class Config:
    mm_projector_type = "linear"
    mm_hidden_size = 1024
    hidden_size = 4096

projector = build_vision_projector(Config())
# projector is nn.Linear(1024, 4096)

Building a 2-layer MLP projector

class Config:
    mm_projector_type = "mlp2x_gelu"
    mm_hidden_size = 1024
    hidden_size = 4096

projector = build_vision_projector(Config())
# projector is nn.Sequential(Linear(1024, 4096), GELU(), Linear(4096, 4096))

Related Pages

Principle:Mit_han_lab_Llm_awq_Vision_Language_Projection

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment