Implementation:OpenGVLab InternVL VQA Batch Loader Inference

Knowledge Sources	OpenGVLab_InternVL
Domains	Inference, VQA, Data_Loading
Last Updated	2026-02-07 14:00 GMT

Overview

This script performs batch-mode VQA inference using a PyTorch DataLoader with a custom Dataset class to efficiently process large visual question answering datasets.

Description

The model_vqa_loader.py script extends the basic VQA inference pipeline with a DataLoader-based architecture for improved throughput on large evaluation datasets. Key components include:

CustomDataset class: A PyTorch Dataset that encapsulates the full preprocessing pipeline in __getitem__:

Reads the question text and prepends appropriate image tokens
Constructs the conversation prompt using the specified template
Loads and preprocesses images via process_images (with aspect ratio handling)
Tokenizes the prompt with tokenizer_image_token
Returns (input_ids, image_tensor) tuples

create_data_loader function: Creates a DataLoader with configurable workers (default: 4) for parallel data loading. Asserts batch_size == 1 since autoregressive generation does not support batched inference.

eval_model function: Loads the model, creates the DataLoader, and iterates through batches. It auto-detects plain (pre-training) models and switches to mmtag conversation mode. Output decoding follows the standard pattern: strip stop string, write JSONL with question_id, prompt, text, answer_id, and model_id.

Usage

Use this script for large-scale VQA benchmark evaluation (VQAv2, GQA, VizWiz) where the parallel data loading workers significantly reduce I/O bottlenecks compared to the sequential model_vqa.py script.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: internvl_chat_llava/llava/eval/model_vqa_loader.py
Lines: 1-144

Signature

class CustomDataset(Dataset):
    def __init__(self, questions, image_folder, tokenizer, image_processor, model_config): ...
    def __getitem__(self, index) -> tuple: ...
    def __len__(self) -> int: ...

def create_data_loader(questions, image_folder, tokenizer, image_processor, model_config,
                       batch_size=1, num_workers=4) -> DataLoader: ...

def eval_model(args: argparse.Namespace) -> None: ...

Import

from llava.eval.model_vqa_loader import CustomDataset, create_data_loader, eval_model

I/O Contract

Inputs

Name	Type	Required	Description
--model-path	str	Yes	Path to the pretrained LLaVA model
--model-base	str	No	Base model path for LoRA or projector-only models
--image-folder	str	No	Root directory for image files
--question-file	str	No	Path to JSONL question file (default: tables/question.jsonl)
--answers-file	str	No	Path for output JSONL answers file (default: answer.jsonl)
--conv-mode	str	No	Conversation template name (default: llava_v1)
--num-chunks	int	No	Number of chunks for multi-GPU splitting (default: 1)
--chunk-idx	int	No	Index of the chunk to process (default: 0)
--temperature	float	No	Sampling temperature (default: 0.2)
--top_p	float	No	Top-p sampling parameter (default: None)
--num_beams	int	No	Number of beams for beam search (default: 1)

Outputs

Name	Type	Description
answers file	JSONL	Each line contains question_id, prompt, text, answer_id, model_id, and metadata

Usage Examples

Basic Usage

# Command-line execution for batch VQA inference
# python internvl_chat_llava/llava/eval/model_vqa_loader.py \
#     --model-path /path/to/llava-model \
#     --image-folder /path/to/images \
#     --question-file questions.jsonl \
#     --answers-file answers.jsonl \
#     --conv-mode llava_v1 \
#     --temperature 0

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment