Principle:Haotian liu LLaVA Batch VQA Inference

Overview

Technique for running scalable visual question answering inference across large evaluation datasets using parallel GPU processing with a DataLoader-based approach.

Description

Batch VQA inference uses a DataLoader-based approach to process evaluation questions efficiently. Questions are split into chunks across multiple GPUs for parallel processing. Each GPU loads the full model and processes its assigned chunk independently, writing results to separate answer files that are later merged into a single output.

The core components of this approach are:

CustomDataset class - Handles image loading from disk, CLIP vision encoder preprocessing, conversation template formatting (with image token injection), and input tokenization. Each item returns a tuple of (input_ids, image_tensor, image_size).
create_data_loader - Wraps the dataset in a PyTorch DataLoader with batch_size=1 (required due to variable image sizes and input sequence lengths) and num_workers=4 for asynchronous data loading.
eval_model - The main inference loop that loads the pretrained model, creates the DataLoader, iterates through questions, generates answers via model.generate(), and writes results as JSONL.

The multi-GPU parallelism is orchestrated at the shell script level, not within the Python code. Each GPU runs a separate Python process with different --chunk-idx values, and all processes run concurrently via bash background jobs.

Usage

Use this for all large-scale benchmark evaluations including VQAv2, GQA, TextVQA, POPE, MMBench, SEED-Bench, and others. The multi-GPU chunk splitting is the standard parallelism strategy used by all V1.5 evaluation scripts under scripts/v1_5/eval/.

Typical Multi-GPU Launch Pattern

#!/bin/bash
gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
IFS=',' read -ra GPULIST <<< "$gpu_list"
CHUNKS=${#GPULIST[@]}

for IDX in $(seq 0 $((CHUNKS-1))); do
    CUDA_VISIBLE_DEVICES=${GPULIST[$IDX]} python -m llava.eval.model_vqa_loader \
        --model-path liuhaotian/llava-v1.5-13b \
        --question-file ./playground/data/eval/vqav2/$SPLIT.jsonl \
        --image-folder ./playground/data/eval/vqav2/test2015 \
        --answers-file ./playground/data/eval/vqav2/answers/$SPLIT/$CKPT/${CHUNKS}_${IDX}.jsonl \
        --num-chunks $CHUNKS \
        --chunk-idx $IDX \
        --temperature 0 \
        --conv-mode vicuna_v1 &
done
wait

# Merge chunk answer files
output_file=./playground/data/eval/vqav2/answers/$SPLIT/$CKPT/merge.jsonl
> "$output_file"
for IDX in $(seq 0 $((CHUNKS-1))); do
    cat ./playground/data/eval/vqav2/answers/$SPLIT/$CKPT/${CHUNKS}_${IDX}.jsonl >> "$output_file"
done

Theoretical Basis

Chunk Splitting Strategy

Chunk splitting divides N questions into K chunks (one per GPU). The split_list() function computes chunk_size = ceil(N / K) and creates roughly equal-sized partitions. Each chunk runs independently with no inter-process communication required.

Batch Size Constraint

The batch size is fixed at 1 (enforced by an assertion in create_data_loader). This is required because:

Images have variable dimensions, making tensor stacking across samples impractical
Input token sequences have variable lengths due to different question texts
The collate_fn uses torch.stack which requires uniform tensor shapes

Answer File Format

Each answer line is a JSON object containing:

Field	Type	Description
`question_id`	int/str	Original question identifier for alignment with ground truth
`prompt`	str	The original question text
`text`	str	Model-generated answer
`answer_id`	str	Unique UUID for this answer (generated via `shortuuid`)
`model_id`	str	Model name derived from checkpoint path
`metadata`	dict	Empty metadata dict (reserved for extensions)

Greedy Decoding

LLaVA v1.5 evaluation uses greedy decoding (temperature=0) by default to ensure reproducibility. When temperature=0, do_sample is set to False, selecting the highest probability token at each step.

Knowledge Sources

Repo - LLaVA - https://github.com/haotian-liu/LLaVA

Domains

Evaluation
Parallel_Computing

Related Pages

Implementation:Haotian_liu_LLaVA_Model_VQA_Loader_Eval

Metadata

Property	Value
last_updated	2026-02-13 14:00 GMT
page_type	Principle
workflow	Benchmark_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment