Principle:Mit han lab Llm awq Vision Language Projection

Knowledge Sources	LLaVA
Domains	Multimodal, Model_Architecture
Last Updated	2026-02-15 00:00 GMT

Overview

Principle of projecting vision encoder features into the language model embedding space using learnable projector modules.

Description

Vision-language projection bridges the dimensional mismatch between a vision encoder's output space and a language model's input embedding space. Multiple projector architectures are supported: simple linear projection, multi-layer MLP with GELU activations, identity mapping (when dimensions already match), and clamped linear projection for constrained output ranges. The projector is typically the only trainable component during visual instruction tuning.

Usage

Apply this principle when connecting a frozen vision encoder to a frozen language model in a multimodal architecture.

Theoretical Basis

Given vision features v of dimension d_v and language embedding dimension d_l:

Linear: h = Wv + b, where W is (d_l x d_v)
MLP: h = W_2 * GELU(W_1 * v + b_1) + b_2
Identity: h = v (when d_v == d_l)

Related Pages

Implementation:Mit_han_lab_Llm_awq_Build_vision_projector

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment