Implementation:Intel Ipex llm Vllm Online Benchmark Multimodal
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, Multimodal, vLLM |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
Concrete tool for benchmarking multimodal vLLM serving endpoints with image and text inputs provided by the IPEX-LLM Docker utilities.
Description
This benchmark tool extends the standard vLLM online benchmark to support vision-language models by including image URLs in the API request payloads. It sends concurrent streaming requests with both text prompts and image references, measuring first-token latency, next-token latency, and throughput for multimodal inference scenarios.
Usage
Use this tool when evaluating the performance of multimodal vLLM endpoints that accept both text and image inputs, such as LLaVA or similar vision-language models deployed on Intel XPU hardware.
Code Reference
Source Location
- Repository: Intel IPEX-LLM
- File: docker/llm/serving/xpu/docker/vllm_online_benchmark_multimodal.py
- Lines: 1-302
Signature
def benchmark(
llm_urls,
model,
prompt,
image_url,
num_requests,
max_concurrent_requests,
max_tokens,
is_warmup=False,
dataset=None,
):
"""Multimodal benchmark orchestrator using thread pool."""
def perform_request(session, url, payload, headers):
"""Execute streaming HTTP request with multimodal content."""
Import
# Standalone benchmark script; run via:
# python vllm_online_benchmark_multimodal.py --model "model" --image-url "url" --prompt "text"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | str | Yes | Model name for API requests |
| prompt | str | No | Text prompt accompanying the image |
| image-url | str | No | URL of the image for multimodal input |
| num-requests | int | No | Number of concurrent requests to send |
| max-concurrent-requests | int | No | Maximum parallel request count |
| max-tokens | int | No | Maximum tokens to generate per request |
Outputs
| Name | Type | Description |
|---|---|---|
| Latency statistics | Console output | Mean, P50, P90, P99 for first-token and next-token latency |
| Throughput metrics | Console output | Total time, requests per second |
Usage Examples
Multimodal Benchmark
python vllm_online_benchmark_multimodal.py \
--model "llava-v1.5-7b" \
--prompt "Describe this image" \
--image-url "https://example.com/image.jpg" \
--num-requests 50 \
--max-concurrent-requests 10 \
--max-tokens 128