Implementation:OpenGVLab InternVL VQA Inference

Knowledge Sources	OpenGVLab_InternVL
Domains	Inference, VQA, Multimodal
Last Updated	2026-02-07 14:00 GMT

Overview

This is the core VQA inference script that loads a LLaVA model and generates answers for visual question answering datasets in a sequential processing loop.

Description

The model_vqa.py script implements the primary inference pipeline for VQA benchmark evaluation. The eval_model function:

Loads the model via load_pretrained_model (tokenizer, model, image_processor, context_len)
Reads questions from a JSONL file and optionally selects a chunk for multi-GPU processing via split_list / get_chunk
For each question, constructs a conversation prompt by prepending the appropriate image token (DEFAULT_IMAGE_TOKEN or with start/end markers) and applying the conversation template
Tokenizes the prompt using tokenizer_image_token to insert IMAGE_TOKEN_INDEX placeholders
Preprocesses the image via the model's image processor
Runs inference with model.generate() using configurable temperature, top_p, and num_beams parameters
Post-processes output by stripping the stop string and writing results to JSONL

The output format includes question_id, prompt, generated text, a unique answer_id (via shortuuid), model_id, and empty metadata. The script uses KeywordsStoppingCriteria to stop generation at the conversation separator token.

Usage

Use this script as the primary VQA inference entry point for benchmarks like VQAv2, GQA, and TextVQA. It supports multi-GPU evaluation through chunk-based question splitting.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: internvl_chat_llava/llava/eval/model_vqa.py
Lines: 1-112

Signature

def split_list(lst: list, n: int) -> list: ...

def get_chunk(lst: list, n: int, k: int) -> list: ...

def eval_model(args: argparse.Namespace) -> None: ...

Import

from llava.eval.model_vqa import eval_model

I/O Contract

Inputs

Name	Type	Required	Description
--model-path	str	Yes	Path to the pretrained LLaVA model
--model-base	str	No	Base model path for LoRA or projector-only models
--image-folder	str	No	Root directory for image files
--question-file	str	No	Path to JSONL question file (default: tables/question.jsonl)
--answers-file	str	No	Path for output JSONL answers file (default: answer.jsonl)
--conv-mode	str	No	Conversation template name (default: llava_v1)
--num-chunks	int	No	Number of chunks for multi-GPU splitting (default: 1)
--chunk-idx	int	No	Index of the chunk to process (default: 0)
--temperature	float	No	Sampling temperature (default: 0.2)
--top_p	float	No	Top-p sampling parameter (default: None)
--num_beams	int	No	Number of beams for beam search (default: 1)

Outputs

Name	Type	Description
answers file	JSONL	Each line contains question_id, prompt, text, answer_id, model_id, and metadata

Usage Examples

Basic Usage

# Command-line execution for VQA inference
# python internvl_chat_llava/llava/eval/model_vqa.py \
#     --model-path /path/to/llava-model \
#     --image-folder /path/to/images \
#     --question-file questions.jsonl \
#     --answers-file answers.jsonl \
#     --conv-mode llava_v1 \
#     --temperature 0

Related Pages

Principle:OpenGVLab_InternVL_Model_Inference_Loading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment