Principle:Mit han lab Llm awq LLaVA Multimodal Architecture

Knowledge Sources	LLaVA
Domains	Multimodal, Model_Architecture
Last Updated	2026-02-15 00:00 GMT

Overview

Principle of combining a CLIP vision encoder with a language model through a multimodal projector to create a visual instruction-following model.

Description

The LLaVA (Large Language and Vision Assistant) architecture uses a pre-trained vision encoder (typically CLIP ViT) to extract image features, a multimodal projector (linear or MLP) to map them into the language model's embedding space, and a pre-trained language model (LLaMA) for text generation. The mixin pattern (LlavaMetaModel + LlavaMetaForCausalLM) enables reuse across different LLM backends (LLaMA, Qwen2). Special image tokens mark where visual features are injected into the text sequence.

Usage

Apply this principle when building visual instruction-following models using the LLaVA architecture pattern with a CLIP encoder and a language model.

Theoretical Basis

Pseudo-code:

# Abstract algorithm
image_features = vision_tower(image)
projected_features = mm_projector(image_features)
text_embeddings = llm.embed(input_ids)
# Replace IMAGE_TOKEN positions with projected_features
combined = interleave(text_embeddings, projected_features)
output = llm(combined)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment