Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Haotian liu LLaVA Model VQA Loader Eval

From Leeroopedia

Overview

Primary batch inference engine for running multi-GPU evaluation of LLaVA models on VQA benchmarks. Provides a DataLoader-based pipeline that handles model loading, image preprocessing, prompt construction, and answer generation.

Description

model_vqa_loader.py provides the eval_model() function which loads a pretrained LLaVA model, creates a DataLoader from a CustomDataset, and iterates through evaluation questions generating answers. The CustomDataset handles image loading from disk, CLIP vision preprocessing via process_images(), conversation template formatting with image token injection, and tokenization via tokenizer_image_token(). Results are written as JSONL with fields: question_id, prompt, text, answer_id, model_id.

The module also provides helper functions split_list() and get_chunk() for dividing questions across multiple GPU processes, and a collate_fn for batching tensors in the DataLoader.

Source

  • llava/eval/model_vqa_loader.py:L79-126 (eval_model)
  • llava/eval/model_vqa_loader.py:L31-61 (CustomDataset)
  • llava/eval/model_vqa_loader.py:L72-76 (create_data_loader)
  • llava/eval/model_vqa_loader.py:L19-27 (split_list, get_chunk)

API Signature

def eval_model(args) -> None:
    """
    Main evaluation function. Loads model, processes questions, writes answers.

    args attributes:
        model_path: str          # Path to pretrained model checkpoint
        model_base: str = None   # Base model for LoRA checkpoints
        image_folder: str        # Root directory containing images
        question_file: str       # Path to input JSONL question file
        answers_file: str        # Path to output JSONL answer file
        num_chunks: int = 1      # Total number of GPU chunks
        chunk_idx: int = 0       # Index of current chunk (0 to num_chunks-1)
        conv_mode: str = 'llava_v1'  # Conversation template name
        temperature: float = 0.2 # Sampling temperature (0 = greedy)
        top_p: float = None      # Nucleus sampling parameter
        num_beams: int = 1       # Beam search width
        max_new_tokens: int = 128  # Maximum generated tokens
    """


class CustomDataset(Dataset):
    """PyTorch Dataset for evaluation questions with image preprocessing."""

    def __init__(self, questions: list, image_folder: str, tokenizer,
                 image_processor, model_config) -> None:
        """
        Args:
            questions: List of dicts with 'image' and 'text' keys
            image_folder: Root directory for image files
            tokenizer: LLaVA tokenizer instance
            image_processor: CLIP image processor
            model_config: Model configuration object
        """

    def __getitem__(self, index) -> tuple:
        """Returns (input_ids: Tensor, image_tensor: Tensor, image_size: tuple[int,int])"""

    def __len__(self) -> int:
        """Returns number of questions."""


def create_data_loader(questions: list, image_folder: str, tokenizer,
                       image_processor, model_config,
                       batch_size: int = 1, num_workers: int = 4) -> DataLoader:
    """
    Creates DataLoader with batch_size=1 (enforced by assertion).
    Uses collate_fn to stack input_ids and image_tensors.
    """

Import

from llava.eval.model_vqa_loader import eval_model

Inputs

Input Format Description
Model checkpoint Directory path Pretrained LLaVA model (e.g., liuhaotian/llava-v1.5-13b)
Question file JSONL One JSON object per line with question_id, image, text fields
Image folder Directory path Root directory containing referenced image files
Chunk params int, int num_chunks (total GPUs) and chunk_idx (this GPU's index)

Outputs

Answer JSONL file with one JSON object per line:

{
    "question_id": 12345,
    "prompt": "What is shown in this image?\nAnswer the question using a single word or phrase.",
    "text": "a cat sitting on a table",
    "answer_id": "abc123shortuuid",
    "model_id": "llava-v1.5-13b",
    "metadata": {}
}

Usage Example

Multi-GPU Launch from vqav2.sh

#!/bin/bash
gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
IFS=',' read -ra GPULIST <<< "$gpu_list"
CHUNKS=${#GPULIST[@]}

CKPT="llava-v1.5-13b"
SPLIT="llava_vqav2_mscoco_test-dev2015"

# Launch parallel inference across all GPUs
for IDX in $(seq 0 $((CHUNKS-1))); do
    CUDA_VISIBLE_DEVICES=${GPULIST[$IDX]} python -m llava.eval.model_vqa_loader \
        --model-path liuhaotian/llava-v1.5-13b \
        --question-file ./playground/data/eval/vqav2/$SPLIT.jsonl \
        --image-folder ./playground/data/eval/vqav2/test2015 \
        --answers-file ./playground/data/eval/vqav2/answers/$SPLIT/$CKPT/${CHUNKS}_${IDX}.jsonl \
        --num-chunks $CHUNKS \
        --chunk-idx $IDX \
        --temperature 0 \
        --conv-mode vicuna_v1 &
done
wait

# Merge all chunk files into a single answer file
output_file=./playground/data/eval/vqav2/answers/$SPLIT/$CKPT/merge.jsonl
> "$output_file"
for IDX in $(seq 0 $((CHUNKS-1))); do
    cat ./playground/data/eval/vqav2/answers/$SPLIT/$CKPT/${CHUNKS}_${IDX}.jsonl >> "$output_file"
done

Single-GPU Inference

python -m llava.eval.model_vqa_loader \
    --model-path liuhaotian/llava-v1.5-13b \
    --question-file ./playground/data/eval/textvqa/llava_textvqa_val_v051_ocr.jsonl \
    --image-folder ./playground/data/eval/textvqa/train_val_images \
    --answers-file ./playground/data/eval/textvqa/answers/llava-v1.5-13b.jsonl \
    --temperature 0 \
    --conv-mode vicuna_v1

Key Implementation Details

  • Image token injection: The CustomDataset.__getitem__ method prepends <image> token to the question text. If mm_use_im_start_end is enabled, it wraps the image token with <im_start> and <im_end> tokens.
  • Conversation template: Uses conv_templates[args.conv_mode].copy() to construct the prompt with proper role prefixes (e.g., "USER:" and "ASSISTANT:" for vicuna_v1).
  • Plain model detection: Auto-switches to mmtag conversation mode if a plain (non-finetuned) model is detected.
  • Output directory creation: Automatically creates the output directory via os.makedirs(..., exist_ok=True).

Related Pages

Metadata

Property Value
last_updated 2026-02-13 14:00 GMT
page_type Implementation (API Doc)
workflow Benchmark_Evaluation
source_file llava/eval/model_vqa_loader.py

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment