Implementation:Mit han lab Llm awq Build vision projector
| Knowledge Sources | |
|---|---|
| Domains | Vision, Model_Architecture |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Factory function and helper modules for building vision-to-language projection layers that bridge CLIP vision features to the language model embedding space in LLaVA.
Description
This module provides the build_vision_projector factory function that constructs the appropriate projection module based on the mm_projector_type in the model configuration. Four projector types are supported:
- "linear": A single nn.Linear layer mapping from mm_hidden_size to hidden_size, providing the simplest feature space transformation.
- "mlp{N}x_gelu": A multi-layer perceptron with N layers and GELU activations. The first layer maps from mm_hidden_size to hidden_size, and subsequent layers maintain hidden_size dimensions with GELU non-linearities between them. The depth N is extracted via regex matching (e.g., "mlp2x_gelu" produces a 2-layer MLP).
- "identity": Uses the IdentityMap class, a pass-through module that returns input unchanged. Its config property reports the projector type for serialization.
- "linearclip": Combines a linear projection with a RangeClip module that clamps output values to a pre-loaded min/max range. The range is loaded from a file specified by config.min_max_range_path and registered as buffers for proper device handling.
SimpleResBlock is a residual block with LayerNorm pre-normalization and a two-layer MLP with GELU activation, available as a building block though not directly used by the factory function.
Usage
Import build_vision_projector when initializing a LLaVA model to construct the mm_projector module. The config object must have mm_hidden_size and hidden_size attributes.
Code Reference
Source Location
- Repository: Mit_han_lab_Llm_awq
- File: tinychat/models/llava_base/multimodal_projector/builder.py
- Lines: 1-72
Signature
class IdentityMap(nn.Module):
def forward(self, x, *args, **kwargs) -> torch.Tensor: ...
@property
def config(self) -> dict: ...
class SimpleResBlock(nn.Module):
def __init__(self, channels: int): ...
def forward(self, x: torch.Tensor) -> torch.Tensor: ...
def build_vision_projector(config, delay_load=False, **kwargs) -> nn.Module: ...
Import
from tinychat.models.llava_base.multimodal_projector.builder import build_vision_projector
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | object | Yes | Configuration with mm_projector_type, mm_hidden_size, hidden_size, and optionally min_max_range_path |
| delay_load | bool | No | Reserved for future use; currently unused |
Outputs
| Name | Type | Description |
|---|---|---|
| projector | nn.Module | A PyTorch module that maps vision features (mm_hidden_size) to language embedding space (hidden_size) |
Usage Examples
Building a linear projector
from tinychat.models.llava_base.multimodal_projector.builder import build_vision_projector
class Config:
mm_projector_type = "linear"
mm_hidden_size = 1024
hidden_size = 4096
projector = build_vision_projector(Config())
# projector is nn.Linear(1024, 4096)
Building a 2-layer MLP projector
class Config:
mm_projector_type = "mlp2x_gelu"
mm_hidden_size = 1024
hidden_size = 4096
projector = build_vision_projector(Config())
# projector is nn.Sequential(Linear(1024, 4096), GELU(), Linear(4096, 4096))