Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL VQA Inference

From Leeroopedia


Knowledge Sources
Domains Inference, VQA, Multimodal
Last Updated 2026-02-07 14:00 GMT

Overview

This is the core VQA inference script that loads a LLaVA model and generates answers for visual question answering datasets in a sequential processing loop.

Description

The model_vqa.py script implements the primary inference pipeline for VQA benchmark evaluation. The eval_model function:

  1. Loads the model via load_pretrained_model (tokenizer, model, image_processor, context_len)
  2. Reads questions from a JSONL file and optionally selects a chunk for multi-GPU processing via split_list / get_chunk
  3. For each question, constructs a conversation prompt by prepending the appropriate image token (DEFAULT_IMAGE_TOKEN or with start/end markers) and applying the conversation template
  4. Tokenizes the prompt using tokenizer_image_token to insert IMAGE_TOKEN_INDEX placeholders
  5. Preprocesses the image via the model's image processor
  6. Runs inference with model.generate() using configurable temperature, top_p, and num_beams parameters
  7. Post-processes output by stripping the stop string and writing results to JSONL

The output format includes question_id, prompt, generated text, a unique answer_id (via shortuuid), model_id, and empty metadata. The script uses KeywordsStoppingCriteria to stop generation at the conversation separator token.

Usage

Use this script as the primary VQA inference entry point for benchmarks like VQAv2, GQA, and TextVQA. It supports multi-GPU evaluation through chunk-based question splitting.

Code Reference

Source Location

Signature

def split_list(lst: list, n: int) -> list: ...

def get_chunk(lst: list, n: int, k: int) -> list: ...

def eval_model(args: argparse.Namespace) -> None: ...

Import

from llava.eval.model_vqa import eval_model

I/O Contract

Inputs

Name Type Required Description
--model-path str Yes Path to the pretrained LLaVA model
--model-base str No Base model path for LoRA or projector-only models
--image-folder str No Root directory for image files
--question-file str No Path to JSONL question file (default: tables/question.jsonl)
--answers-file str No Path for output JSONL answers file (default: answer.jsonl)
--conv-mode str No Conversation template name (default: llava_v1)
--num-chunks int No Number of chunks for multi-GPU splitting (default: 1)
--chunk-idx int No Index of the chunk to process (default: 0)
--temperature float No Sampling temperature (default: 0.2)
--top_p float No Top-p sampling parameter (default: None)
--num_beams int No Number of beams for beam search (default: 1)

Outputs

Name Type Description
answers file JSONL Each line contains question_id, prompt, text, answer_id, model_id, and metadata

Usage Examples

Basic Usage

# Command-line execution for VQA inference
# python internvl_chat_llava/llava/eval/model_vqa.py \
#     --model-path /path/to/llava-model \
#     --image-folder /path/to/images \
#     --question-file questions.jsonl \
#     --answers-file answers.jsonl \
#     --conv-mode llava_v1 \
#     --temperature 0

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment