Implementation:Haotian liu LLaVA Model VQA Loader Eval
Overview
Primary batch inference engine for running multi-GPU evaluation of LLaVA models on VQA benchmarks. Provides a DataLoader-based pipeline that handles model loading, image preprocessing, prompt construction, and answer generation.
Description
model_vqa_loader.py provides the eval_model() function which loads a pretrained LLaVA model, creates a DataLoader from a CustomDataset, and iterates through evaluation questions generating answers. The CustomDataset handles image loading from disk, CLIP vision preprocessing via process_images(), conversation template formatting with image token injection, and tokenization via tokenizer_image_token(). Results are written as JSONL with fields: question_id, prompt, text, answer_id, model_id.
The module also provides helper functions split_list() and get_chunk() for dividing questions across multiple GPU processes, and a collate_fn for batching tensors in the DataLoader.
Source
llava/eval/model_vqa_loader.py:L79-126(eval_model)llava/eval/model_vqa_loader.py:L31-61(CustomDataset)llava/eval/model_vqa_loader.py:L72-76(create_data_loader)llava/eval/model_vqa_loader.py:L19-27(split_list,get_chunk)
API Signature
def eval_model(args) -> None:
"""
Main evaluation function. Loads model, processes questions, writes answers.
args attributes:
model_path: str # Path to pretrained model checkpoint
model_base: str = None # Base model for LoRA checkpoints
image_folder: str # Root directory containing images
question_file: str # Path to input JSONL question file
answers_file: str # Path to output JSONL answer file
num_chunks: int = 1 # Total number of GPU chunks
chunk_idx: int = 0 # Index of current chunk (0 to num_chunks-1)
conv_mode: str = 'llava_v1' # Conversation template name
temperature: float = 0.2 # Sampling temperature (0 = greedy)
top_p: float = None # Nucleus sampling parameter
num_beams: int = 1 # Beam search width
max_new_tokens: int = 128 # Maximum generated tokens
"""
class CustomDataset(Dataset):
"""PyTorch Dataset for evaluation questions with image preprocessing."""
def __init__(self, questions: list, image_folder: str, tokenizer,
image_processor, model_config) -> None:
"""
Args:
questions: List of dicts with 'image' and 'text' keys
image_folder: Root directory for image files
tokenizer: LLaVA tokenizer instance
image_processor: CLIP image processor
model_config: Model configuration object
"""
def __getitem__(self, index) -> tuple:
"""Returns (input_ids: Tensor, image_tensor: Tensor, image_size: tuple[int,int])"""
def __len__(self) -> int:
"""Returns number of questions."""
def create_data_loader(questions: list, image_folder: str, tokenizer,
image_processor, model_config,
batch_size: int = 1, num_workers: int = 4) -> DataLoader:
"""
Creates DataLoader with batch_size=1 (enforced by assertion).
Uses collate_fn to stack input_ids and image_tensors.
"""
Import
from llava.eval.model_vqa_loader import eval_model
Inputs
| Input | Format | Description |
|---|---|---|
| Model checkpoint | Directory path | Pretrained LLaVA model (e.g., liuhaotian/llava-v1.5-13b)
|
| Question file | JSONL | One JSON object per line with question_id, image, text fields
|
| Image folder | Directory path | Root directory containing referenced image files |
| Chunk params | int, int | num_chunks (total GPUs) and chunk_idx (this GPU's index)
|
Outputs
Answer JSONL file with one JSON object per line:
{
"question_id": 12345,
"prompt": "What is shown in this image?\nAnswer the question using a single word or phrase.",
"text": "a cat sitting on a table",
"answer_id": "abc123shortuuid",
"model_id": "llava-v1.5-13b",
"metadata": {}
}
Usage Example
Multi-GPU Launch from vqav2.sh
#!/bin/bash
gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
IFS=',' read -ra GPULIST <<< "$gpu_list"
CHUNKS=${#GPULIST[@]}
CKPT="llava-v1.5-13b"
SPLIT="llava_vqav2_mscoco_test-dev2015"
# Launch parallel inference across all GPUs
for IDX in $(seq 0 $((CHUNKS-1))); do
CUDA_VISIBLE_DEVICES=${GPULIST[$IDX]} python -m llava.eval.model_vqa_loader \
--model-path liuhaotian/llava-v1.5-13b \
--question-file ./playground/data/eval/vqav2/$SPLIT.jsonl \
--image-folder ./playground/data/eval/vqav2/test2015 \
--answers-file ./playground/data/eval/vqav2/answers/$SPLIT/$CKPT/${CHUNKS}_${IDX}.jsonl \
--num-chunks $CHUNKS \
--chunk-idx $IDX \
--temperature 0 \
--conv-mode vicuna_v1 &
done
wait
# Merge all chunk files into a single answer file
output_file=./playground/data/eval/vqav2/answers/$SPLIT/$CKPT/merge.jsonl
> "$output_file"
for IDX in $(seq 0 $((CHUNKS-1))); do
cat ./playground/data/eval/vqav2/answers/$SPLIT/$CKPT/${CHUNKS}_${IDX}.jsonl >> "$output_file"
done
Single-GPU Inference
python -m llava.eval.model_vqa_loader \
--model-path liuhaotian/llava-v1.5-13b \
--question-file ./playground/data/eval/textvqa/llava_textvqa_val_v051_ocr.jsonl \
--image-folder ./playground/data/eval/textvqa/train_val_images \
--answers-file ./playground/data/eval/textvqa/answers/llava-v1.5-13b.jsonl \
--temperature 0 \
--conv-mode vicuna_v1
Key Implementation Details
- Image token injection: The
CustomDataset.__getitem__method prepends<image>token to the question text. Ifmm_use_im_start_endis enabled, it wraps the image token with<im_start>and<im_end>tokens. - Conversation template: Uses
conv_templates[args.conv_mode].copy()to construct the prompt with proper role prefixes (e.g., "USER:" and "ASSISTANT:" for vicuna_v1). - Plain model detection: Auto-switches to
mmtagconversation mode if a plain (non-finetuned) model is detected. - Output directory creation: Automatically creates the output directory via
os.makedirs(..., exist_ok=True).
Related Pages
- implements Principle:Haotian_liu_LLaVA_Batch_VQA_Inference
- Environment:Haotian_liu_LLaVA_Python_CUDA_Training_Environment
- Heuristic:Haotian_liu_LLaVA_Use_Cache_Training_Inference_Toggle
Metadata
| Property | Value |
|---|---|
| last_updated | 2026-02-13 14:00 GMT |
| page_type | Implementation (API Doc) |
| workflow | Benchmark_Evaluation |
| source_file | llava/eval/model_vqa_loader.py |