Principle:Mit han lab Llm awq Vision Language Projection
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, Model_Architecture |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Principle of projecting vision encoder features into the language model embedding space using learnable projector modules.
Description
Vision-language projection bridges the dimensional mismatch between a vision encoder's output space and a language model's input embedding space. Multiple projector architectures are supported: simple linear projection, multi-layer MLP with GELU activations, identity mapping (when dimensions already match), and clamped linear projection for constrained output ranges. The projector is typically the only trainable component during visual instruction tuning.
Usage
Apply this principle when connecting a frozen vision encoder to a frozen language model in a multimodal architecture.
Theoretical Basis
Given vision features v of dimension d_v and language embedding dimension d_l:
- Linear: h = Wv + b, where W is (d_l x d_v)
- MLP: h = W_2 * GELU(W_1 * v + b_1) + b_2
- Identity: h = v (when d_v == d_l)