Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Haotian liu LLaVA Batch VQA Inference

From Leeroopedia

Overview

Technique for running scalable visual question answering inference across large evaluation datasets using parallel GPU processing with a DataLoader-based approach.

Description

Batch VQA inference uses a DataLoader-based approach to process evaluation questions efficiently. Questions are split into chunks across multiple GPUs for parallel processing. Each GPU loads the full model and processes its assigned chunk independently, writing results to separate answer files that are later merged into a single output.

The core components of this approach are:

  • CustomDataset class - Handles image loading from disk, CLIP vision encoder preprocessing, conversation template formatting (with image token injection), and input tokenization. Each item returns a tuple of (input_ids, image_tensor, image_size).
  • create_data_loader - Wraps the dataset in a PyTorch DataLoader with batch_size=1 (required due to variable image sizes and input sequence lengths) and num_workers=4 for asynchronous data loading.
  • eval_model - The main inference loop that loads the pretrained model, creates the DataLoader, iterates through questions, generates answers via model.generate(), and writes results as JSONL.

The multi-GPU parallelism is orchestrated at the shell script level, not within the Python code. Each GPU runs a separate Python process with different --chunk-idx values, and all processes run concurrently via bash background jobs.

Usage

Use this for all large-scale benchmark evaluations including VQAv2, GQA, TextVQA, POPE, MMBench, SEED-Bench, and others. The multi-GPU chunk splitting is the standard parallelism strategy used by all V1.5 evaluation scripts under scripts/v1_5/eval/.

Typical Multi-GPU Launch Pattern

#!/bin/bash
gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
IFS=',' read -ra GPULIST <<< "$gpu_list"
CHUNKS=${#GPULIST[@]}

for IDX in $(seq 0 $((CHUNKS-1))); do
    CUDA_VISIBLE_DEVICES=${GPULIST[$IDX]} python -m llava.eval.model_vqa_loader \
        --model-path liuhaotian/llava-v1.5-13b \
        --question-file ./playground/data/eval/vqav2/$SPLIT.jsonl \
        --image-folder ./playground/data/eval/vqav2/test2015 \
        --answers-file ./playground/data/eval/vqav2/answers/$SPLIT/$CKPT/${CHUNKS}_${IDX}.jsonl \
        --num-chunks $CHUNKS \
        --chunk-idx $IDX \
        --temperature 0 \
        --conv-mode vicuna_v1 &
done
wait

# Merge chunk answer files
output_file=./playground/data/eval/vqav2/answers/$SPLIT/$CKPT/merge.jsonl
> "$output_file"
for IDX in $(seq 0 $((CHUNKS-1))); do
    cat ./playground/data/eval/vqav2/answers/$SPLIT/$CKPT/${CHUNKS}_${IDX}.jsonl >> "$output_file"
done

Theoretical Basis

Chunk Splitting Strategy

Chunk splitting divides N questions into K chunks (one per GPU). The split_list() function computes chunk_size = ceil(N / K) and creates roughly equal-sized partitions. Each chunk runs independently with no inter-process communication required.

Batch Size Constraint

The batch size is fixed at 1 (enforced by an assertion in create_data_loader). This is required because:

  • Images have variable dimensions, making tensor stacking across samples impractical
  • Input token sequences have variable lengths due to different question texts
  • The collate_fn uses torch.stack which requires uniform tensor shapes

Answer File Format

Each answer line is a JSON object containing:

Field Type Description
question_id int/str Original question identifier for alignment with ground truth
prompt str The original question text
text str Model-generated answer
answer_id str Unique UUID for this answer (generated via shortuuid)
model_id str Model name derived from checkpoint path
metadata dict Empty metadata dict (reserved for extensions)

Greedy Decoding

LLaVA v1.5 evaluation uses greedy decoding (temperature=0) by default to ensure reproducibility. When temperature=0, do_sample is set to False, selecting the highest probability token at each step.

Knowledge Sources

Domains

  • Evaluation
  • Parallel_Computing

Related Pages

Metadata

Property Value
last_updated 2026-02-13 14:00 GMT
page_type Principle
workflow Benchmark_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment