Implementation:Vllm project Vllm Benchmark Serving Structured Output
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, Structured Output, LLM Serving |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Benchmarks online serving throughput and latency for structured output generation (JSON schema, grammar, and regex constrained outputs) across multiple LLM serving backends.
Description
This Python script measures the performance of LLM serving backends when generating structured outputs. It supports JSON schema, JSON-unique (per-request unique schemas), grammar (BNF), and regex constrained generation modes. The benchmark sends async requests at configurable rates, measures TTFT (time to first token), TPOT (time per output token), ITL (inter-token latency), and E2EL (end-to-end latency) metrics, calculates request throughput and goodput (successful requests only), and optionally evaluates output correctness. Results are saved to JSON for analysis.
Usage
Start a vLLM server (or other compatible backend) on one machine, then run this script as a client to benchmark structured output performance. The script supports multiple backends via the --backend flag and offers fine-grained control over structured output ratio, request rate, number of prompts, output length, and schema configuration. It imports request functions from backend_request_func.py.
Code Reference
Source Location
- Repository: vllm
- File: benchmarks/benchmark_serving_structured_output.py
- Lines: 1-1040
Signature
@dataclass
class BenchmarkMetrics:
completed: int
total_input: int
total_output: int
request_throughput: float
request_goodput: float
output_throughput: float
total_token_throughput: float
mean_ttft_ms: float
median_ttft_ms: float
std_ttft_ms: float
percentiles_ttft_ms: list[tuple[float, float]]
mean_tpot_ms: float
median_tpot_ms: float
std_tpot_ms: float
percentiles_tpot_ms: list[tuple[float, float]]
mean_itl_ms: float
median_itl_ms: float
std_itl_ms: float
percentiles_itl_ms: list[tuple[float, float]]
mean_e2el_ms: float
median_e2el_ms: float
std_e2el_ms: float
percentiles_e2el_ms: list[tuple[float, float]]
@dataclass
class SampleRequest:
prompt: str
prompt_len: int
expected_output_len: int
schema: dict
structure_type: str
completion: str = None
def sample_requests(tokenizer: PreTrainedTokenizerBase, args: argparse.Namespace) -> list[SampleRequest]
async def get_request(input_requests: list[SampleRequest], request_rate: float) -> AsyncGenerator[SampleRequest, None]
async def benchmark(backend: str, api_url: str, ...) -> BenchmarkMetrics
def evaluate(args: argparse.Namespace, output_list: list[RequestFuncOutput], ...)
def main(args: argparse.Namespace)
Import
# This is a standalone executable script.
# Run from the command line:
python benchmarks/benchmark_serving_structured_output.py \
--backend vllm \
--model <your_model> \
--dataset json \
--structured-output-ratio 1.0 \
--request-rate 10 \
--num-prompts 1000
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --backend | str | Yes | Backend type: vllm, tgi, openai, openai-chat, lmdeploy, sglang, etc. |
| --model | str | Yes | Model name or path served by the backend |
| --dataset | str | Yes | Dataset type: json, json-unique, grammar, or regex |
| --structured-output-ratio | float | No | Fraction of requests with structured output constraints (default: 1.0) |
| --request-rate | float | No | Target request rate in requests/second (default: inf for max throughput) |
| --num-prompts | int | No | Number of prompts to benchmark |
| --output-len | int | No | Maximum output length in tokens |
| --json-schema-path | str | No | Path to custom JSON schema file |
| --structure-type | str | No | Structure type for guided decoding (e.g., json_object, json_schema) |
| --host | str | No | Server hostname (default: localhost) |
| --port | int | No | Server port (default: 8000) |
| --save-result | bool | No | Whether to save results to a JSON file |
| --result-dir | str | No | Directory for saving result files |
| --goodput | list[str] | No | Goodput criteria (e.g., "ttft:100" for TTFT under 100ms) |
Outputs
| Name | Type | Description |
|---|---|---|
| BenchmarkMetrics | dataclass | Comprehensive metrics including throughput, TTFT, TPOT, ITL, E2EL statistics |
| Console output | stdout | Formatted benchmark results printed to terminal |
| JSON results | .json file | Detailed results saved to file when --save-result is enabled |
| Evaluation report | stdout | Optional correctness evaluation of structured output (JSON validity check) |
Usage Examples
# Benchmark JSON schema-constrained output with vLLM backend:
# python benchmarks/benchmark_serving_structured_output.py \
# --backend vllm \
# --model meta-llama/Llama-2-7b-hf \
# --dataset json \
# --structured-output-ratio 1.0 \
# --request-rate 10 \
# --num-prompts 1000
# Benchmark grammar-constrained output (SQL generation):
# python benchmarks/benchmark_serving_structured_output.py \
# --backend vllm \
# --model meta-llama/Llama-2-7b-hf \
# --dataset grammar \
# --num-prompts 500
# Benchmark with unique JSON schemas per request:
# python benchmarks/benchmark_serving_structured_output.py \
# --backend vllm \
# --model meta-llama/Llama-2-7b-hf \
# --dataset json-unique \
# --num-prompts 200