Principle:Haotian liu LLaVA Multimodal Response Generation
Overview
Inference technique that fuses visual embeddings with text token embeddings and generates text responses through autoregressive decoding.
Description
Multimodal response generation is the core inference step in LLaVA. It bridges the gap between visual perception (CLIP encoder) and language generation (LLM decoder) through an embedding fusion process followed by standard autoregressive text generation.
The process has two stages:
Stage 1: Visual-Text Embedding Fusion
prepare_inputs_labels_for_multimodal() replaces IMAGE_TOKEN_INDEX placeholders in the tokenized input with actual visual embeddings:
- Visual encoding -- Images are passed through the CLIP vision tower to produce patch-level features.
- Projection -- The CLIP features are projected into the LLM's embedding space via the MLP projector (
mm_projector). - Fusion -- The function splits
input_idsatIMAGE_TOKEN_INDEXpositions, converts text token IDs to embeddings, and interleaves the text embeddings with the projected visual embeddings. - Output -- A fused
input_embedstensor where each image token has been expanded into N visual tokens (typically 576 for a 24x24 patch grid).
Stage 2: Autoregressive Decoding
model.generate() performs autoregressive decoding on the fused embedding sequence:
- Temperature sampling (temperature > 0) -- Samples from the softmax distribution with temperature scaling
- Greedy decoding (temperature = 0) -- Selects the highest-probability token at each step
- Beam search (num_beams > 1) -- Maintains multiple hypothesis beams
Usage
Use as the final step in any LLaVA inference pipeline, after:
- Model loading (via
load_pretrained_model()) - Image preprocessing (via
process_images()) - Prompt construction and tokenization (via
Conversation+tokenizer_image_token())
Theoretical Basis
prepare_inputs_labels_for_multimodal() works by:
- Encoding images through
vision_tower(CLIP ViT-L/14) to produce feature vectors for each image patch - Projecting via
mm_projector(a 2-layer MLP) to align visual features with the LLM's embedding space - Splitting
input_idsatIMAGE_TOKEN_INDEXpositions into text segments - Interleaving text embeddings (from the LLM's embedding layer) with projected visual embeddings
- Creating a fused
input_embedstensor with corresponding attention mask and position IDs
The fusion expands the effective sequence length: each <image> token is replaced by 576 visual tokens (for 336x336 input with 14x14 patch size = 24x24 = 576 patches). This means a prompt with one image effectively has 576 additional tokens.
Generation uses the standard HuggingFace generate() method. When temperature > 0, the model uses multinomial sampling with top_p nucleus sampling. When temperature == 0, it falls back to greedy decoding (do_sample=False).
Metadata
| Field | Value |
|---|---|
| Knowledge Sources | Paper - Visual Instruction Tuning - https://arxiv.org/abs/2304.08485 |
| Domains | Multimodal_Inference, Text_Generation |
| Last Updated | 2026-02-13 14:00 GMT |