Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Mit han lab Llm awq LLaVA Multimodal Architecture

From Leeroopedia
Revision as of 17:33, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Mit_han_lab_Llm_awq_LLaVA_Multimodal_Architecture.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Multimodal, Model_Architecture
Last Updated 2026-02-15 00:00 GMT

Overview

Principle of combining a CLIP vision encoder with a language model through a multimodal projector to create a visual instruction-following model.

Description

The LLaVA (Large Language and Vision Assistant) architecture uses a pre-trained vision encoder (typically CLIP ViT) to extract image features, a multimodal projector (linear or MLP) to map them into the language model's embedding space, and a pre-trained language model (LLaMA) for text generation. The mixin pattern (LlavaMetaModel + LlavaMetaForCausalLM) enables reuse across different LLM backends (LLaMA, Qwen2). Special image tokens mark where visual features are injected into the text sequence.

Usage

Apply this principle when building visual instruction-following models using the LLaVA architecture pattern with a CLIP encoder and a language model.

Theoretical Basis

Pseudo-code:

# Abstract algorithm
image_features = vision_tower(image)
projected_features = mm_projector(image_features)
text_embeddings = llm.embed(input_ids)
# Replace IMAGE_TOKEN positions with projected_features
combined = interleave(text_embeddings, projected_features)
output = llm(combined)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment