Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Mit han lab Llm awq Vision Language Projection

From Leeroopedia
Knowledge Sources
Domains Multimodal, Model_Architecture
Last Updated 2026-02-15 00:00 GMT

Overview

Principle of projecting vision encoder features into the language model embedding space using learnable projector modules.

Description

Vision-language projection bridges the dimensional mismatch between a vision encoder's output space and a language model's input embedding space. Multiple projector architectures are supported: simple linear projection, multi-layer MLP with GELU activations, identity mapping (when dimensions already match), and clamped linear projection for constrained output ranges. The projector is typically the only trainable component during visual instruction tuning.

Usage

Apply this principle when connecting a frozen vision encoder to a frozen language model in a multimodal architecture.

Theoretical Basis

Given vision features v of dimension d_v and language embedding dimension d_l:

  • Linear: h = Wv + b, where W is (d_l x d_v)
  • MLP: h = W_2 * GELU(W_1 * v + b_1) + b_2
  • Identity: h = v (when d_v == d_l)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment