Workflow:Open compass VLMEvalKit API Model Evaluation
| Knowledge Sources | |
|---|---|
| Domains | VLM_Evaluation, API_Integration, Benchmarking |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
End-to-end process for evaluating commercial API-based Vision-Language Models (GPT-4o, Claude, Gemini, etc.) on benchmarks using VLMEvalKit's parallel API calling infrastructure.
Description
This workflow covers evaluating VLMs accessed through provider APIs rather than locally loaded models. API model evaluation differs from local model evaluation in several ways: it requires API key configuration, uses parallel HTTP request dispatching instead of GPU-based inference, employs retry logic with exponential backoff for reliability, and encodes images as base64 for transmission. VLMEvalKit supports 30+ API providers including OpenAI, Anthropic, Google, Alibaba Qwen, Tencent Hunyuan, and many others.
Usage
Execute this workflow when you need to evaluate a commercial or hosted VLM that is accessed via HTTP API. You should have valid API credentials for the target provider and sufficient API quota/budget for the evaluation. No GPU is required for the evaluation client itself, though a judge LLM API key is still needed for open-ended evaluation.
Execution Steps
Step 1: Configure API Credentials
Set up API keys for both the target model provider and any judge LLMs. Create a .env file at the VLMEvalKit root with provider-specific keys (e.g., OPENAI_API_KEY, GOOGLE_API_KEY, DASHSCOPE_API_KEY). Each API wrapper class reads its corresponding environment variable.
Key considerations:
- Each provider has its own environment variable naming convention
- Some providers require multiple credentials (e.g., Hunyuan needs both SECRET_KEY and SECRET_ID)
- Set EVAL_PROXY for routing API calls through a proxy during evaluation
- API keys can also be set directly as environment variables instead of the .env file
Step 2: Select API Model
Choose an API model from the registry. API models are defined in vlmeval/config.py under provider-specific dictionaries and accessed via the supported_VLM registry. Each API model is backed by a wrapper class in vlmeval/api/ that extends BaseAPI.
Key considerations:
- Use vlmutil mlist api to list available API models
- API models are identified by name strings (e.g., GPT4o, Claude3.5-Sonnet, GeminiPro1-5)
- Custom API model configurations can be defined in the JSON config system
- The FWD_API environment variable can route all API models through the OpenAI-compatible interface
Step 3: Configure Parallel Calling Parameters
Set the number of parallel API request threads and retry behavior. API evaluation uses concurrent HTTP requests with progress tracking via track_progress_rich().
What happens:
- The --api-nproc argument controls parallelism (default: 4 threads)
- BaseAPI provides automatic retry with exponential backoff on failures
- --retry sets the maximum number of retry attempts per request
- --verbose enables detailed logging of API interactions
- Failed API calls are marked with a sentinel message and can be retried
Step 4: Run API Inference
Launch the evaluation using python run.py (not torchrun, since no GPU parallelism is needed). The API inference engine (infer_data_api) builds prompts, dispatches parallel API requests, tracks progress, and handles retries.
What happens:
- Prompts are built for each dataset sample (text + base64-encoded images)
- Requests are dispatched in parallel via thread pool
- Progress is tracked with a rich progress bar
- Results are periodically checkpointed to pickle files
- On completion, results are merged into the final prediction file
- Already-completed indices are skipped on resume
Step 5: Run Evaluation with Judge LLM
After inference, evaluation proceeds identically to local model evaluation. The judge LLM (typically GPT-based) is configured automatically based on the benchmark type, or can be overridden with --judge.
Key considerations:
- API evaluation results may contain API failure markers that need attention
- Use vlmutil scan to detect and report API failures in results
- Budget for both the target model API costs and the judge model API costs
- A local judge LLM can be deployed via LMDeploy to avoid judge API costs
Step 6: Review and Compare Results
Examine evaluation results and compare API model performance against local models or other API models. Results follow the same file format as local model evaluation.
Key considerations:
- API models may have rate limits that affect throughput
- Re-running with --reuse skips already-completed predictions
- Use scripts/summarize.py to create comparison tables across models
- The OpenVLM Leaderboard provides reference scores for many API models