Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL LlavaMetaModel LlavaMetaForCausalLM

From Leeroopedia


Knowledge Sources
Domains Multimodal Models, Vision-Language, LLaVA
Last Updated 2026-02-07 14:00 GMT

Overview

Core multimodal architecture mixin classes that inject vision capabilities into language models, defining how visual information is fused with text tokens in the LLaVA architecture.

Description

This module provides two mixin classes that form the architectural heart of LLaVA.

LlavaMetaModel is a mixin for the base model that initializes the vision_tower (via build_vision_tower) and mm_projector (via build_vision_projector) when the config contains an mm_vision_tower attribute. Its initialize_vision_modules() method builds the vision tower and projector from model arguments, optionally loads pretrained projector weights (including vision tower position embeddings), and stores vision configuration in the model config.

LlavaMetaForCausalLM is an abstract mixin for causal LM models. Its encode_images() method passes images through the vision tower and projector to produce language-compatible features. The critical prepare_inputs_labels_for_multimodal() method performs the core multimodal fusion: it locates IMAGE_TOKEN_INDEX placeholders in input_ids, replaces them with encoded image features, masks image positions in labels with IGNORE_INDEX, handles variable-length sequences through padding and alignment, and supports both single and batched multi-image inputs. It also handles the mm_use_im_start_end mode where special start/end tokens surround image features. The initialize_vision_tokenizer() method adds special image tokens to the tokenizer and initializes their embeddings.

Usage

These mixin classes are inherited by all LLaVA model variants (LLaMA-based, MPT-based) to add multimodal capabilities. They are not used directly but provide the shared multimodal logic.

Code Reference

Source Location

Signature

class LlavaMetaModel:
    def __init__(self, config): ...
    def get_vision_tower(self): ...
    def initialize_vision_modules(self, model_args, fsdp=None): ...

class LlavaMetaForCausalLM(ABC):
    @abstractmethod
    def get_model(self): ...
    def get_vision_tower(self): ...
    def encode_images(self, images): ...
    def prepare_inputs_labels_for_multimodal(
        self, input_ids, attention_mask, past_key_values, labels, images): ...
    def initialize_vision_tokenizer(self, model_args, tokenizer): ...

Import

from llava.model.llava_arch import LlavaMetaModel, LlavaMetaForCausalLM

I/O Contract

Inputs (prepare_inputs_labels_for_multimodal)

Name Type Required Description
input_ids torch.LongTensor Yes Token IDs containing IMAGE_TOKEN_INDEX placeholders
attention_mask torch.Tensor Yes Attention mask for the input sequence
past_key_values Tuple No Cached key/value states for generation
labels torch.Tensor No Target labels for training
images torch.Tensor or List[torch.Tensor] No Image tensors or list of image tensors

Outputs (prepare_inputs_labels_for_multimodal)

Name Type Description
input_ids None Always None (replaced by inputs_embeds)
attention_mask torch.Tensor Updated attention mask accounting for image tokens
past_key_values Tuple Passed through unchanged
inputs_embeds torch.Tensor Combined text and image embeddings
labels torch.Tensor Updated labels with IGNORE_INDEX at image positions

Usage Examples

Basic Usage

# Typically used via inheritance, not directly:
class LlavaLlamaModel(LlavaMetaModel, LlamaModel):
    config_class = LlavaConfig

class LlavaLlamaForCausalLM(LlamaForCausalLM, LlavaMetaForCausalLM):
    def forward(self, input_ids=None, images=None, **kwargs):
        input_ids, attention_mask, past_key_values, inputs_embeds, labels = \
            self.prepare_inputs_labels_for_multimodal(
                input_ids, attention_mask, past_key_values, labels, images)
        return super().forward(inputs_embeds=inputs_embeds, **kwargs)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment