Implementation:OpenGVLab InternVL LlavaMetaModel LlavaMetaForCausalLM
| Knowledge Sources | |
|---|---|
| Domains | Multimodal Models, Vision-Language, LLaVA |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Core multimodal architecture mixin classes that inject vision capabilities into language models, defining how visual information is fused with text tokens in the LLaVA architecture.
Description
This module provides two mixin classes that form the architectural heart of LLaVA.
LlavaMetaModel is a mixin for the base model that initializes the vision_tower (via build_vision_tower) and mm_projector (via build_vision_projector) when the config contains an mm_vision_tower attribute. Its initialize_vision_modules() method builds the vision tower and projector from model arguments, optionally loads pretrained projector weights (including vision tower position embeddings), and stores vision configuration in the model config.
LlavaMetaForCausalLM is an abstract mixin for causal LM models. Its encode_images() method passes images through the vision tower and projector to produce language-compatible features. The critical prepare_inputs_labels_for_multimodal() method performs the core multimodal fusion: it locates IMAGE_TOKEN_INDEX placeholders in input_ids, replaces them with encoded image features, masks image positions in labels with IGNORE_INDEX, handles variable-length sequences through padding and alignment, and supports both single and batched multi-image inputs. It also handles the mm_use_im_start_end mode where special start/end tokens surround image features. The initialize_vision_tokenizer() method adds special image tokens to the tokenizer and initializes their embeddings.
Usage
These mixin classes are inherited by all LLaVA model variants (LLaMA-based, MPT-based) to add multimodal capabilities. They are not used directly but provide the shared multimodal logic.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/model/llava_arch.py
- Lines: 1-265
Signature
class LlavaMetaModel:
def __init__(self, config): ...
def get_vision_tower(self): ...
def initialize_vision_modules(self, model_args, fsdp=None): ...
class LlavaMetaForCausalLM(ABC):
@abstractmethod
def get_model(self): ...
def get_vision_tower(self): ...
def encode_images(self, images): ...
def prepare_inputs_labels_for_multimodal(
self, input_ids, attention_mask, past_key_values, labels, images): ...
def initialize_vision_tokenizer(self, model_args, tokenizer): ...
Import
from llava.model.llava_arch import LlavaMetaModel, LlavaMetaForCausalLM
I/O Contract
Inputs (prepare_inputs_labels_for_multimodal)
| Name | Type | Required | Description |
|---|---|---|---|
| input_ids | torch.LongTensor | Yes | Token IDs containing IMAGE_TOKEN_INDEX placeholders |
| attention_mask | torch.Tensor | Yes | Attention mask for the input sequence |
| past_key_values | Tuple | No | Cached key/value states for generation |
| labels | torch.Tensor | No | Target labels for training |
| images | torch.Tensor or List[torch.Tensor] | No | Image tensors or list of image tensors |
Outputs (prepare_inputs_labels_for_multimodal)
| Name | Type | Description |
|---|---|---|
| input_ids | None | Always None (replaced by inputs_embeds) |
| attention_mask | torch.Tensor | Updated attention mask accounting for image tokens |
| past_key_values | Tuple | Passed through unchanged |
| inputs_embeds | torch.Tensor | Combined text and image embeddings |
| labels | torch.Tensor | Updated labels with IGNORE_INDEX at image positions |
Usage Examples
Basic Usage
# Typically used via inheritance, not directly:
class LlavaLlamaModel(LlavaMetaModel, LlamaModel):
config_class = LlavaConfig
class LlavaLlamaForCausalLM(LlamaForCausalLM, LlavaMetaForCausalLM):
def forward(self, input_ids=None, images=None, **kwargs):
input_ids, attention_mask, past_key_values, inputs_embeds, labels = \
self.prepare_inputs_labels_for_multimodal(
input_ids, attention_mask, past_key_values, labels, images)
return super().forward(inputs_embeds=inputs_embeds, **kwargs)