Implementation:Haotian liu LLaVA Model Generate Multimodal
Appearance
Overview
Concrete tool for generating text responses from multimodal inputs using LLaVA's fused vision-language model. Combines visual embedding injection with autoregressive text decoding.
Sources
- File:
llava/eval/run_llava.py, Lines: L114-128 (generate call) - File:
llava/model/llava_arch.py, Lines: L145-324 (prepare_inputs_labels_for_multimodal)
Signature
model.generate()
# Called on a LlavaLlamaForCausalLM instance:
output_ids = model.generate(
input_ids: torch.Tensor, # Tokenized prompt with IMAGE_TOKEN_INDEX
images: torch.Tensor, # Preprocessed image tensor
image_sizes: List[Tuple[int, int]], # Original image dimensions
do_sample: bool, # Whether to use sampling (True if temp > 0)
temperature: float, # Sampling temperature
top_p: float, # Nucleus sampling threshold
num_beams: int, # Number of beams for beam search
max_new_tokens: int, # Maximum tokens to generate
use_cache: bool = True, # Use KV cache for efficient generation
) -> torch.Tensor
prepare_inputs_labels_for_multimodal() (internal)
def prepare_inputs_labels_for_multimodal(
self,
input_ids: torch.Tensor,
position_ids: torch.Tensor,
attention_mask: torch.Tensor,
past_key_values,
labels: torch.Tensor,
images: torch.Tensor,
image_sizes: Optional[List[Tuple[int, int]]] = None
) -> Tuple[None, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
"""
Replace IMAGE_TOKEN_INDEX in input_ids with actual visual embeddings.
Returns:
Tuple of (None, position_ids, attention_mask, past_key_values, input_embeds, labels)
where input_embeds contains the fused visual-text embeddings.
"""
Import
# Model loaded via load_pretrained_model; generate is a method on the model instance
from llava.model.builder import load_pretrained_model
tokenizer, model, image_processor, context_len = load_pretrained_model(...)
output_ids = model.generate(...)
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
input_ids |
torch.Tensor |
Yes | Tokenized prompt containing IMAGE_TOKEN_INDEX (-200) at image positions
|
images |
torch.Tensor |
Yes | Preprocessed image tensor from process_images()
|
image_sizes |
List[Tuple[int,int]] |
For anyres | Original image dimensions (width, height) |
do_sample |
bool | Yes | True for temperature sampling, False for greedy
|
temperature |
float | Yes | Sampling temperature (0.0 for greedy) |
top_p |
float | No | Nucleus sampling threshold (default: None)
|
num_beams |
int | No | Beam search width (default: 1)
|
max_new_tokens |
int | Yes | Maximum number of tokens to generate |
use_cache |
bool | No | Use KV cache (default: True)
|
Outputs
| Output | Type | Description |
|---|---|---|
output_ids |
torch.Tensor |
Generated token IDs (shape: [batch, seq_len])
|
The output is decoded to text via:
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
Usage Example
from llava.model.builder import load_pretrained_model
from llava.mm_utils import process_images, tokenizer_image_token
from llava.conversation import conv_templates
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from PIL import Image
import torch
# 1. Load model
tokenizer, model, image_processor, context_len = load_pretrained_model(
model_path="liuhaotian/llava-v1.5-13b",
model_base=None,
model_name="llava-v1.5-13b"
)
# 2. Preprocess image
image = Image.open("photo.jpg").convert("RGB")
image_tensor = process_images([image], image_processor, model.config)
image_tensor = image_tensor.to(model.device, dtype=torch.float16)
# 3. Construct prompt
conv = conv_templates["llava_v1"].copy()
conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nDescribe this image.")
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
# 4. Tokenize
input_ids = tokenizer_image_token(
prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt'
).unsqueeze(0).cuda()
# 5. Generate
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=image_tensor,
image_sizes=[image.size],
do_sample=True,
temperature=0.2,
top_p=None,
num_beams=1,
max_new_tokens=512,
use_cache=True,
)
# 6. Decode
output_text = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(output_text)
Description
model.generate() with the images parameter triggers the full multimodal generation pipeline:
Internal Flow
- Forward hook -- The overridden
prepare_inputs_for_generation()method detects theimagesparameter and callsprepare_inputs_labels_for_multimodal(). - Visual encoding -- Images pass through the CLIP vision tower (
self.get_model().get_vision_tower()(images)) producing patch features. - Projection -- Patch features are projected via
self.get_model().mm_projector(image_features)into the LLM embedding space. - Embedding fusion --
input_idsis split atIMAGE_TOKEN_INDEXpositions. Text segments are embedded via the LLM's embedding layer. Visual embeddings are inserted between text segments. - Sequence construction -- The fused
input_embedstensor, along with updatedattention_maskandposition_ids, is passed to the LLM for autoregressive generation. - Token generation -- The LLM generates tokens autoregressively until
max_new_tokensis reached or an EOS token is produced.
Embedding Expansion
Each <image> token expands to 576 visual tokens (for 336x336 input with 14x14 patch size, yielding a 24x24 grid). For anyres mode, the number of visual tokens scales with the number of patches plus the global view.
Metadata
| Field | Value |
|---|---|
| Knowledge Sources | Paper - Visual Instruction Tuning - https://arxiv.org/abs/2304.08485 |
| Domains | Multimodal_Inference, Text_Generation |
| Last Updated | 2026-02-13 14:00 GMT |
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment