Workflow:Open compass VLMEvalKit API Model Evaluation

Knowledge Sources	VLMEvalKit Quickstart Guide Config System
Domains	VLM_Evaluation, API_Integration, Benchmarking
Last Updated	2026-02-14 00:00 GMT

Overview

End-to-end process for evaluating commercial API-based Vision-Language Models (GPT-4o, Claude, Gemini, etc.) on benchmarks using VLMEvalKit's parallel API calling infrastructure.

Description

This workflow covers evaluating VLMs accessed through provider APIs rather than locally loaded models. API model evaluation differs from local model evaluation in several ways: it requires API key configuration, uses parallel HTTP request dispatching instead of GPU-based inference, employs retry logic with exponential backoff for reliability, and encodes images as base64 for transmission. VLMEvalKit supports 30+ API providers including OpenAI, Anthropic, Google, Alibaba Qwen, Tencent Hunyuan, and many others.

Usage

Execute this workflow when you need to evaluate a commercial or hosted VLM that is accessed via HTTP API. You should have valid API credentials for the target provider and sufficient API quota/budget for the evaluation. No GPU is required for the evaluation client itself, though a judge LLM API key is still needed for open-ended evaluation.

Execution Steps

Step 1: Configure API Credentials

Set up API keys for both the target model provider and any judge LLMs. Create a .env file at the VLMEvalKit root with provider-specific keys (e.g., OPENAI_API_KEY, GOOGLE_API_KEY, DASHSCOPE_API_KEY). Each API wrapper class reads its corresponding environment variable.

Key considerations:

Each provider has its own environment variable naming convention
Some providers require multiple credentials (e.g., Hunyuan needs both SECRET_KEY and SECRET_ID)
Set EVAL_PROXY for routing API calls through a proxy during evaluation
API keys can also be set directly as environment variables instead of the .env file

Step 2: Select API Model

Choose an API model from the registry. API models are defined in vlmeval/config.py under provider-specific dictionaries and accessed via the supported_VLM registry. Each API model is backed by a wrapper class in vlmeval/api/ that extends BaseAPI.

Key considerations:

Use vlmutil mlist api to list available API models
API models are identified by name strings (e.g., GPT4o, Claude3.5-Sonnet, GeminiPro1-5)
Custom API model configurations can be defined in the JSON config system
The FWD_API environment variable can route all API models through the OpenAI-compatible interface

Step 3: Configure Parallel Calling Parameters

Set the number of parallel API request threads and retry behavior. API evaluation uses concurrent HTTP requests with progress tracking via track_progress_rich().

What happens:

The --api-nproc argument controls parallelism (default: 4 threads)
BaseAPI provides automatic retry with exponential backoff on failures
--retry sets the maximum number of retry attempts per request
--verbose enables detailed logging of API interactions
Failed API calls are marked with a sentinel message and can be retried

Step 4: Run API Inference

Launch the evaluation using python run.py (not torchrun, since no GPU parallelism is needed). The API inference engine (infer_data_api) builds prompts, dispatches parallel API requests, tracks progress, and handles retries.

What happens:

Prompts are built for each dataset sample (text + base64-encoded images)
Requests are dispatched in parallel via thread pool
Progress is tracked with a rich progress bar
Results are periodically checkpointed to pickle files
On completion, results are merged into the final prediction file
Already-completed indices are skipped on resume

Step 5: Run Evaluation with Judge LLM

After inference, evaluation proceeds identically to local model evaluation. The judge LLM (typically GPT-based) is configured automatically based on the benchmark type, or can be overridden with --judge.

Key considerations:

API evaluation results may contain API failure markers that need attention
Use vlmutil scan to detect and report API failures in results
Budget for both the target model API costs and the judge model API costs
A local judge LLM can be deployed via LMDeploy to avoid judge API costs

Step 6: Review and Compare Results

Examine evaluation results and compare API model performance against local models or other API models. Results follow the same file format as local model evaluation.

Key considerations:

API models may have rate limits that affect throughput
Re-running with --reuse skips already-completed predictions
Use scripts/summarize.py to create comparison tables across models
The OpenVLM Leaderboard provides reference scores for many API models

Execution Diagram

GitHub URL

Workflow Repository