Environment:Vllm project Vllm Benchmarks
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, Performance_Testing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Benchmarking environment for measuring and evaluating vLLM inference performance, including throughput, latency, time-to-first-token (TTFT), and structured output generation speed across different model configurations and serving scenarios.
Description
This environment provides the runtime dependencies and tooling required to execute vLLM's benchmark suite. The benchmarks cover both offline batch inference and online serving scenarios. The online serving benchmarks use asynchronous HTTP clients to simulate concurrent request loads against a running vLLM API server, measuring key performance indicators such as requests per second, inter-token latency, and end-to-end request latency. The offline benchmarks measure raw throughput and memory utilization for batch inference workloads. Structured output benchmarks specifically evaluate performance when generating JSON or grammar-constrained outputs. Backend request functions abstract the HTTP transport layer to support benchmarking against different serving backends (vLLM, TGI, etc.).
Usage
Benchmarks are executed from the benchmarks/ directory in the vLLM repository. The primary scripts are benchmark_serving.py for online serving benchmarks and benchmark_throughput.py for offline throughput measurement. Benchmark datasets (ShareGPT, SONNET, synthetic) must be downloaded or generated before running. Results are emitted as JSON for programmatic analysis and as formatted tables for human consumption.
Requirements
| Requirement | Value |
|---|---|
| Python | >= 3.10 |
| aiohttp | >= 3.13.3 (async HTTP client for load generation) |
| requests | >= 2.26.0 |
| transformers | >= 4.56.0 |
| numpy | (any) |
| Benchmark Datasets | ShareGPT dataset JSON, or synthetic prompts |
| Running vLLM Server | Required for online serving benchmarks |
| GPU/CPU | Hardware matching the target deployment configuration |