Implementation:Intel Ipex llm Vllm Online Benchmark Multimodal

Knowledge Sources	Intel IPEX-LLM
Domains	Benchmarking, Multimodal, vLLM
Last Updated	2026-02-09 04:00 GMT

Overview

Concrete tool for benchmarking multimodal vLLM serving endpoints with image and text inputs provided by the IPEX-LLM Docker utilities.

Description

This benchmark tool extends the standard vLLM online benchmark to support vision-language models by including image URLs in the API request payloads. It sends concurrent streaming requests with both text prompts and image references, measuring first-token latency, next-token latency, and throughput for multimodal inference scenarios.

Usage

Use this tool when evaluating the performance of multimodal vLLM endpoints that accept both text and image inputs, such as LLaVA or similar vision-language models deployed on Intel XPU hardware.

Code Reference

Source Location

Repository: Intel IPEX-LLM
File: docker/llm/serving/xpu/docker/vllm_online_benchmark_multimodal.py
Lines: 1-302

Signature

def benchmark(
    llm_urls,
    model,
    prompt,
    image_url,
    num_requests,
    max_concurrent_requests,
    max_tokens,
    is_warmup=False,
    dataset=None,
):
    """Multimodal benchmark orchestrator using thread pool."""

def perform_request(session, url, payload, headers):
    """Execute streaming HTTP request with multimodal content."""

Import

# Standalone benchmark script; run via:
# python vllm_online_benchmark_multimodal.py --model "model" --image-url "url" --prompt "text"

I/O Contract

Inputs

Name	Type	Required	Description
model	str	Yes	Model name for API requests
prompt	str	No	Text prompt accompanying the image
image-url	str	No	URL of the image for multimodal input
num-requests	int	No	Number of concurrent requests to send
max-concurrent-requests	int	No	Maximum parallel request count
max-tokens	int	No	Maximum tokens to generate per request

Outputs

Name	Type	Description
Latency statistics	Console output	Mean, P50, P90, P99 for first-token and next-token latency
Throughput metrics	Console output	Total time, requests per second

Usage Examples

Multimodal Benchmark

python vllm_online_benchmark_multimodal.py \
    --model "llava-v1.5-7b" \
    --prompt "Describe this image" \
    --image-url "https://example.com/image.jpg" \
    --num-requests 50 \
    --max-concurrent-requests 10 \
    --max-tokens 128

Related Pages

Environment:Intel_Ipex_llm_vLLM_XPU_Serving_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment