Principle:Haotian liu LLaVA Multimodal Response Generation

Overview

Inference technique that fuses visual embeddings with text token embeddings and generates text responses through autoregressive decoding.

Description

Multimodal response generation is the core inference step in LLaVA. It bridges the gap between visual perception (CLIP encoder) and language generation (LLM decoder) through an embedding fusion process followed by standard autoregressive text generation.

The process has two stages:

Stage 1: Visual-Text Embedding Fusion

prepare_inputs_labels_for_multimodal() replaces IMAGE_TOKEN_INDEX placeholders in the tokenized input with actual visual embeddings:

Visual encoding -- Images are passed through the CLIP vision tower to produce patch-level features.
Projection -- The CLIP features are projected into the LLM's embedding space via the MLP projector (mm_projector).
Fusion -- The function splits input_ids at IMAGE_TOKEN_INDEX positions, converts text token IDs to embeddings, and interleaves the text embeddings with the projected visual embeddings.
Output -- A fused input_embeds tensor where each image token has been expanded into N visual tokens (typically 576 for a 24x24 patch grid).

Stage 2: Autoregressive Decoding

model.generate() performs autoregressive decoding on the fused embedding sequence:

Temperature sampling (temperature > 0) -- Samples from the softmax distribution with temperature scaling
Greedy decoding (temperature = 0) -- Selects the highest-probability token at each step
Beam search (num_beams > 1) -- Maintains multiple hypothesis beams

Usage

Use as the final step in any LLaVA inference pipeline, after:

Model loading (via load_pretrained_model())
Image preprocessing (via process_images())
Prompt construction and tokenization (via Conversation + tokenizer_image_token())

Theoretical Basis

prepare_inputs_labels_for_multimodal() works by:

Encoding images through vision_tower (CLIP ViT-L/14) to produce feature vectors for each image patch
Projecting via mm_projector (a 2-layer MLP) to align visual features with the LLM's embedding space
Splitting input_ids at IMAGE_TOKEN_INDEX positions into text segments
Interleaving text embeddings (from the LLM's embedding layer) with projected visual embeddings
Creating a fused input_embeds tensor with corresponding attention mask and position IDs

The fusion expands the effective sequence length: each <image> token is replaced by 576 visual tokens (for 336x336 input with 14x14 patch size = 24x24 = 576 patches). This means a prompt with one image effectively has 576 additional tokens.

Generation uses the standard HuggingFace generate() method. When temperature > 0, the model uses multinomial sampling with top_p nucleus sampling. When temperature == 0, it falls back to greedy decoding (do_sample=False).

Metadata

Field	Value
Knowledge Sources	Paper - Visual Instruction Tuning - https://arxiv.org/abs/2304.08485
Domains	Multimodal_Inference, Text_Generation
Last Updated	2026-02-13 14:00 GMT

Related Pages

Implementation:Haotian_liu_LLaVA_Model_Generate_Multimodal

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment