Implementation:OpenGVLab InternVL VQA Inference
| Knowledge Sources | |
|---|---|
| Domains | Inference, VQA, Multimodal |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This is the core VQA inference script that loads a LLaVA model and generates answers for visual question answering datasets in a sequential processing loop.
Description
The model_vqa.py script implements the primary inference pipeline for VQA benchmark evaluation. The eval_model function:
- Loads the model via
load_pretrained_model(tokenizer, model, image_processor, context_len) - Reads questions from a JSONL file and optionally selects a chunk for multi-GPU processing via
split_list/get_chunk - For each question, constructs a conversation prompt by prepending the appropriate image token (
DEFAULT_IMAGE_TOKENor with start/end markers) and applying the conversation template - Tokenizes the prompt using
tokenizer_image_tokento insert IMAGE_TOKEN_INDEX placeholders - Preprocesses the image via the model's image processor
- Runs inference with
model.generate()using configurable temperature, top_p, and num_beams parameters - Post-processes output by stripping the stop string and writing results to JSONL
The output format includes question_id, prompt, generated text, a unique answer_id (via shortuuid), model_id, and empty metadata. The script uses KeywordsStoppingCriteria to stop generation at the conversation separator token.
Usage
Use this script as the primary VQA inference entry point for benchmarks like VQAv2, GQA, and TextVQA. It supports multi-GPU evaluation through chunk-based question splitting.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/eval/model_vqa.py
- Lines: 1-112
Signature
def split_list(lst: list, n: int) -> list: ...
def get_chunk(lst: list, n: int, k: int) -> list: ...
def eval_model(args: argparse.Namespace) -> None: ...
Import
from llava.eval.model_vqa import eval_model
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --model-path | str | Yes | Path to the pretrained LLaVA model |
| --model-base | str | No | Base model path for LoRA or projector-only models |
| --image-folder | str | No | Root directory for image files |
| --question-file | str | No | Path to JSONL question file (default: tables/question.jsonl) |
| --answers-file | str | No | Path for output JSONL answers file (default: answer.jsonl) |
| --conv-mode | str | No | Conversation template name (default: llava_v1) |
| --num-chunks | int | No | Number of chunks for multi-GPU splitting (default: 1) |
| --chunk-idx | int | No | Index of the chunk to process (default: 0) |
| --temperature | float | No | Sampling temperature (default: 0.2) |
| --top_p | float | No | Top-p sampling parameter (default: None) |
| --num_beams | int | No | Number of beams for beam search (default: 1) |
Outputs
| Name | Type | Description |
|---|---|---|
| answers file | JSONL | Each line contains question_id, prompt, text, answer_id, model_id, and metadata |
Usage Examples
Basic Usage
# Command-line execution for VQA inference
# python internvl_chat_llava/llava/eval/model_vqa.py \
# --model-path /path/to/llava-model \
# --image-folder /path/to/images \
# --question-file questions.jsonl \
# --answers-file answers.jsonl \
# --conv-mode llava_v1 \
# --temperature 0