Implementation:OpenGVLab InternVL VQA Batch Loader Inference
| Knowledge Sources | |
|---|---|
| Domains | Inference, VQA, Data_Loading |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This script performs batch-mode VQA inference using a PyTorch DataLoader with a custom Dataset class to efficiently process large visual question answering datasets.
Description
The model_vqa_loader.py script extends the basic VQA inference pipeline with a DataLoader-based architecture for improved throughput on large evaluation datasets. Key components include:
CustomDataset class: A PyTorch Dataset that encapsulates the full preprocessing pipeline in __getitem__:
- Reads the question text and prepends appropriate image tokens
- Constructs the conversation prompt using the specified template
- Loads and preprocesses images via
process_images(with aspect ratio handling) - Tokenizes the prompt with
tokenizer_image_token - Returns (input_ids, image_tensor) tuples
create_data_loader function: Creates a DataLoader with configurable workers (default: 4) for parallel data loading. Asserts batch_size == 1 since autoregressive generation does not support batched inference.
eval_model function: Loads the model, creates the DataLoader, and iterates through batches. It auto-detects plain (pre-training) models and switches to mmtag conversation mode. Output decoding follows the standard pattern: strip stop string, write JSONL with question_id, prompt, text, answer_id, and model_id.
Usage
Use this script for large-scale VQA benchmark evaluation (VQAv2, GQA, VizWiz) where the parallel data loading workers significantly reduce I/O bottlenecks compared to the sequential model_vqa.py script.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/eval/model_vqa_loader.py
- Lines: 1-144
Signature
class CustomDataset(Dataset):
def __init__(self, questions, image_folder, tokenizer, image_processor, model_config): ...
def __getitem__(self, index) -> tuple: ...
def __len__(self) -> int: ...
def create_data_loader(questions, image_folder, tokenizer, image_processor, model_config,
batch_size=1, num_workers=4) -> DataLoader: ...
def eval_model(args: argparse.Namespace) -> None: ...
Import
from llava.eval.model_vqa_loader import CustomDataset, create_data_loader, eval_model
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --model-path | str | Yes | Path to the pretrained LLaVA model |
| --model-base | str | No | Base model path for LoRA or projector-only models |
| --image-folder | str | No | Root directory for image files |
| --question-file | str | No | Path to JSONL question file (default: tables/question.jsonl) |
| --answers-file | str | No | Path for output JSONL answers file (default: answer.jsonl) |
| --conv-mode | str | No | Conversation template name (default: llava_v1) |
| --num-chunks | int | No | Number of chunks for multi-GPU splitting (default: 1) |
| --chunk-idx | int | No | Index of the chunk to process (default: 0) |
| --temperature | float | No | Sampling temperature (default: 0.2) |
| --top_p | float | No | Top-p sampling parameter (default: None) |
| --num_beams | int | No | Number of beams for beam search (default: 1) |
Outputs
| Name | Type | Description |
|---|---|---|
| answers file | JSONL | Each line contains question_id, prompt, text, answer_id, model_id, and metadata |
Usage Examples
Basic Usage
# Command-line execution for batch VQA inference
# python internvl_chat_llava/llava/eval/model_vqa_loader.py \
# --model-path /path/to/llava-model \
# --image-folder /path/to/images \
# --question-file questions.jsonl \
# --answers-file answers.jsonl \
# --conv-mode llava_v1 \
# --temperature 0