Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Haotian liu LLaVA Multimodal Response Generation

From Leeroopedia

Overview

Inference technique that fuses visual embeddings with text token embeddings and generates text responses through autoregressive decoding.

Description

Multimodal response generation is the core inference step in LLaVA. It bridges the gap between visual perception (CLIP encoder) and language generation (LLM decoder) through an embedding fusion process followed by standard autoregressive text generation.

The process has two stages:

Stage 1: Visual-Text Embedding Fusion

prepare_inputs_labels_for_multimodal() replaces IMAGE_TOKEN_INDEX placeholders in the tokenized input with actual visual embeddings:

  1. Visual encoding -- Images are passed through the CLIP vision tower to produce patch-level features.
  2. Projection -- The CLIP features are projected into the LLM's embedding space via the MLP projector (mm_projector).
  3. Fusion -- The function splits input_ids at IMAGE_TOKEN_INDEX positions, converts text token IDs to embeddings, and interleaves the text embeddings with the projected visual embeddings.
  4. Output -- A fused input_embeds tensor where each image token has been expanded into N visual tokens (typically 576 for a 24x24 patch grid).

Stage 2: Autoregressive Decoding

model.generate() performs autoregressive decoding on the fused embedding sequence:

  • Temperature sampling (temperature > 0) -- Samples from the softmax distribution with temperature scaling
  • Greedy decoding (temperature = 0) -- Selects the highest-probability token at each step
  • Beam search (num_beams > 1) -- Maintains multiple hypothesis beams

Usage

Use as the final step in any LLaVA inference pipeline, after:

  1. Model loading (via load_pretrained_model())
  2. Image preprocessing (via process_images())
  3. Prompt construction and tokenization (via Conversation + tokenizer_image_token())

Theoretical Basis

prepare_inputs_labels_for_multimodal() works by:

  1. Encoding images through vision_tower (CLIP ViT-L/14) to produce feature vectors for each image patch
  2. Projecting via mm_projector (a 2-layer MLP) to align visual features with the LLM's embedding space
  3. Splitting input_ids at IMAGE_TOKEN_INDEX positions into text segments
  4. Interleaving text embeddings (from the LLM's embedding layer) with projected visual embeddings
  5. Creating a fused input_embeds tensor with corresponding attention mask and position IDs

The fusion expands the effective sequence length: each <image> token is replaced by 576 visual tokens (for 336x336 input with 14x14 patch size = 24x24 = 576 patches). This means a prompt with one image effectively has 576 additional tokens.

Generation uses the standard HuggingFace generate() method. When temperature > 0, the model uses multinomial sampling with top_p nucleus sampling. When temperature == 0, it falls back to greedy decoding (do_sample=False).

Metadata

Field Value
Knowledge Sources Paper - Visual Instruction Tuning - https://arxiv.org/abs/2304.08485
Domains Multimodal_Inference, Text_Generation
Last Updated 2026-02-13 14:00 GMT

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment