Principle:Mit han lab Llm awq LLaVA Multimodal Architecture
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, Model_Architecture |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Principle of combining a CLIP vision encoder with a language model through a multimodal projector to create a visual instruction-following model.
Description
The LLaVA (Large Language and Vision Assistant) architecture uses a pre-trained vision encoder (typically CLIP ViT) to extract image features, a multimodal projector (linear or MLP) to map them into the language model's embedding space, and a pre-trained language model (LLaMA) for text generation. The mixin pattern (LlavaMetaModel + LlavaMetaForCausalLM) enables reuse across different LLM backends (LLaMA, Qwen2). Special image tokens mark where visual features are injected into the text sequence.
Usage
Apply this principle when building visual instruction-following models using the LLaVA architecture pattern with a CLIP encoder and a language model.
Theoretical Basis
Pseudo-code:
# Abstract algorithm
image_features = vision_tower(image)
projected_features = mm_projector(image_features)
text_embeddings = llm.embed(input_ids)
# Replace IMAGE_TOKEN positions with projected_features
combined = interleave(text_embeddings, projected_features)
output = llm(combined)