Principle:OpenGVLab InternVL LLaVA Multimodal Architecture

Knowledge Sources	OpenGVLab_InternVL
Domains	Multimodal Models, Vision-Language, LLaVA
Last Updated	2026-02-07 14:00 GMT

Overview

The LLaVA multimodal architecture principle defines a mixin-based approach for injecting vision capabilities into pretrained language models, enabling any causal LM to process interleaved image and text inputs.

Description

The LLaVA (Large Language-and-Vision Assistant) architecture uses mixin classes to add multimodal capabilities to existing language models without modifying their core architecture. Two mixin classes form the pattern:

Model mixin (LlavaMetaModel): Adds a vision tower (image encoder) and mm_projector (vision-to-language projection) to the base model. Provides initialization and weight-loading methods.

CausalLM mixin (LlavaMetaForCausalLM): Adds image encoding (vision tower + projector), the critical input preparation method that replaces image placeholder tokens with encoded image features in the embedding space, and tokenizer initialization for special image tokens.

The core multimodal fusion operates by: (1) encoding images through a frozen vision tower, (2) projecting features to the language model dimension, (3) replacing IMAGE_TOKEN_INDEX placeholders in the input token sequence with the projected image features, (4) masking image positions in training labels with IGNORE_INDEX so the loss is not computed on image tokens, and (5) handling variable-length padding across batch items.

This mixin pattern allows the same multimodal logic to be composed with any causal language model (LLaMA, MPT, etc.) through multiple inheritance.

Usage

Apply this principle when building multimodal language models that need to process both text and image inputs, particularly when supporting multiple language model backends.

Theoretical Basis

The LLaVA architecture is based on the paper "Visual Instruction Tuning" (Liu et al., 2023), which proposes a simple yet effective approach to multimodal instruction following by connecting a vision encoder to a language model through a projection layer. The mixin design pattern enables model-agnostic multimodal capabilities.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment