Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Intel Ipex llm Vllm Online Benchmark Multimodal

From Leeroopedia


Knowledge Sources
Domains Benchmarking, Multimodal, vLLM
Last Updated 2026-02-09 04:00 GMT

Overview

Concrete tool for benchmarking multimodal vLLM serving endpoints with image and text inputs provided by the IPEX-LLM Docker utilities.

Description

This benchmark tool extends the standard vLLM online benchmark to support vision-language models by including image URLs in the API request payloads. It sends concurrent streaming requests with both text prompts and image references, measuring first-token latency, next-token latency, and throughput for multimodal inference scenarios.

Usage

Use this tool when evaluating the performance of multimodal vLLM endpoints that accept both text and image inputs, such as LLaVA or similar vision-language models deployed on Intel XPU hardware.

Code Reference

Source Location

Signature

def benchmark(
    llm_urls,
    model,
    prompt,
    image_url,
    num_requests,
    max_concurrent_requests,
    max_tokens,
    is_warmup=False,
    dataset=None,
):
    """Multimodal benchmark orchestrator using thread pool."""

def perform_request(session, url, payload, headers):
    """Execute streaming HTTP request with multimodal content."""

Import

# Standalone benchmark script; run via:
# python vllm_online_benchmark_multimodal.py --model "model" --image-url "url" --prompt "text"

I/O Contract

Inputs

Name Type Required Description
model str Yes Model name for API requests
prompt str No Text prompt accompanying the image
image-url str No URL of the image for multimodal input
num-requests int No Number of concurrent requests to send
max-concurrent-requests int No Maximum parallel request count
max-tokens int No Maximum tokens to generate per request

Outputs

Name Type Description
Latency statistics Console output Mean, P50, P90, P99 for first-token and next-token latency
Throughput metrics Console output Total time, requests per second

Usage Examples

Multimodal Benchmark

python vllm_online_benchmark_multimodal.py \
    --model "llava-v1.5-7b" \
    --prompt "Describe this image" \
    --image-url "https://example.com/image.jpg" \
    --num-requests 50 \
    --max-concurrent-requests 10 \
    --max-tokens 128

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment