Implementation:OpenGVLab InternVL LlavaMetaModel LlavaMetaForCausalLM

Knowledge Sources	OpenGVLab_InternVL
Domains	Multimodal Models, Vision-Language, LLaVA
Last Updated	2026-02-07 14:00 GMT

Overview

Core multimodal architecture mixin classes that inject vision capabilities into language models, defining how visual information is fused with text tokens in the LLaVA architecture.

Description

This module provides two mixin classes that form the architectural heart of LLaVA.

LlavaMetaModel is a mixin for the base model that initializes the vision_tower (via build_vision_tower) and mm_projector (via build_vision_projector) when the config contains an mm_vision_tower attribute. Its initialize_vision_modules() method builds the vision tower and projector from model arguments, optionally loads pretrained projector weights (including vision tower position embeddings), and stores vision configuration in the model config.

LlavaMetaForCausalLM is an abstract mixin for causal LM models. Its encode_images() method passes images through the vision tower and projector to produce language-compatible features. The critical prepare_inputs_labels_for_multimodal() method performs the core multimodal fusion: it locates IMAGE_TOKEN_INDEX placeholders in input_ids, replaces them with encoded image features, masks image positions in labels with IGNORE_INDEX, handles variable-length sequences through padding and alignment, and supports both single and batched multi-image inputs. It also handles the mm_use_im_start_end mode where special start/end tokens surround image features. The initialize_vision_tokenizer() method adds special image tokens to the tokenizer and initializes their embeddings.

Usage

These mixin classes are inherited by all LLaVA model variants (LLaMA-based, MPT-based) to add multimodal capabilities. They are not used directly but provide the shared multimodal logic.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: internvl_chat_llava/llava/model/llava_arch.py
Lines: 1-265

Signature

class LlavaMetaModel:
    def __init__(self, config): ...
    def get_vision_tower(self): ...
    def initialize_vision_modules(self, model_args, fsdp=None): ...

class LlavaMetaForCausalLM(ABC):
    @abstractmethod
    def get_model(self): ...
    def get_vision_tower(self): ...
    def encode_images(self, images): ...
    def prepare_inputs_labels_for_multimodal(
        self, input_ids, attention_mask, past_key_values, labels, images): ...
    def initialize_vision_tokenizer(self, model_args, tokenizer): ...

Import

from llava.model.llava_arch import LlavaMetaModel, LlavaMetaForCausalLM

I/O Contract

Inputs (prepare_inputs_labels_for_multimodal)

Name	Type	Required	Description
input_ids	torch.LongTensor	Yes	Token IDs containing IMAGE_TOKEN_INDEX placeholders
attention_mask	torch.Tensor	Yes	Attention mask for the input sequence
past_key_values	Tuple	No	Cached key/value states for generation
labels	torch.Tensor	No	Target labels for training
images	torch.Tensor or List[torch.Tensor]	No	Image tensors or list of image tensors

Outputs (prepare_inputs_labels_for_multimodal)

Name	Type	Description
input_ids	None	Always None (replaced by inputs_embeds)
attention_mask	torch.Tensor	Updated attention mask accounting for image tokens
past_key_values	Tuple	Passed through unchanged
inputs_embeds	torch.Tensor	Combined text and image embeddings
labels	torch.Tensor	Updated labels with IGNORE_INDEX at image positions

Usage Examples

Basic Usage

# Typically used via inheritance, not directly:
class LlavaLlamaModel(LlavaMetaModel, LlamaModel):
    config_class = LlavaConfig

class LlavaLlamaForCausalLM(LlamaForCausalLM, LlavaMetaForCausalLM):
    def forward(self, input_ids=None, images=None, **kwargs):
        input_ids, attention_mask, past_key_values, inputs_embeds, labels = \
            self.prepare_inputs_labels_for_multimodal(
                input_ids, attention_mask, past_key_values, labels, images)
        return super().forward(inputs_embeds=inputs_embeds, **kwargs)

Related Pages

Principle:OpenGVLab_InternVL_LLaVA_Multimodal_Architecture

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment