Implementation:Vllm project Vllm Benchmark Serving Structured Output

Knowledge Sources	vllm
Domains	Benchmarking, Structured Output, LLM Serving
Last Updated	2026-02-08 00:00 GMT

Overview

Benchmarks online serving throughput and latency for structured output generation (JSON schema, grammar, and regex constrained outputs) across multiple LLM serving backends.

Description

This Python script measures the performance of LLM serving backends when generating structured outputs. It supports JSON schema, JSON-unique (per-request unique schemas), grammar (BNF), and regex constrained generation modes. The benchmark sends async requests at configurable rates, measures TTFT (time to first token), TPOT (time per output token), ITL (inter-token latency), and E2EL (end-to-end latency) metrics, calculates request throughput and goodput (successful requests only), and optionally evaluates output correctness. Results are saved to JSON for analysis.

Usage

Start a vLLM server (or other compatible backend) on one machine, then run this script as a client to benchmark structured output performance. The script supports multiple backends via the --backend flag and offers fine-grained control over structured output ratio, request rate, number of prompts, output length, and schema configuration. It imports request functions from backend_request_func.py.

Code Reference

Source Location

Repository: vllm
File: benchmarks/benchmark_serving_structured_output.py
Lines: 1-1040

Signature

@dataclass
class BenchmarkMetrics:
    completed: int
    total_input: int
    total_output: int
    request_throughput: float
    request_goodput: float
    output_throughput: float
    total_token_throughput: float
    mean_ttft_ms: float
    median_ttft_ms: float
    std_ttft_ms: float
    percentiles_ttft_ms: list[tuple[float, float]]
    mean_tpot_ms: float
    median_tpot_ms: float
    std_tpot_ms: float
    percentiles_tpot_ms: list[tuple[float, float]]
    mean_itl_ms: float
    median_itl_ms: float
    std_itl_ms: float
    percentiles_itl_ms: list[tuple[float, float]]
    mean_e2el_ms: float
    median_e2el_ms: float
    std_e2el_ms: float
    percentiles_e2el_ms: list[tuple[float, float]]

@dataclass
class SampleRequest:
    prompt: str
    prompt_len: int
    expected_output_len: int
    schema: dict
    structure_type: str
    completion: str = None

def sample_requests(tokenizer: PreTrainedTokenizerBase, args: argparse.Namespace) -> list[SampleRequest]
async def get_request(input_requests: list[SampleRequest], request_rate: float) -> AsyncGenerator[SampleRequest, None]
async def benchmark(backend: str, api_url: str, ...) -> BenchmarkMetrics
def evaluate(args: argparse.Namespace, output_list: list[RequestFuncOutput], ...)
def main(args: argparse.Namespace)

Import

# This is a standalone executable script.
# Run from the command line:
python benchmarks/benchmark_serving_structured_output.py \
    --backend vllm \
    --model <your_model> \
    --dataset json \
    --structured-output-ratio 1.0 \
    --request-rate 10 \
    --num-prompts 1000

I/O Contract

Inputs

Name	Type	Required	Description
--backend	str	Yes	Backend type: vllm, tgi, openai, openai-chat, lmdeploy, sglang, etc.
--model	str	Yes	Model name or path served by the backend
--dataset	str	Yes	Dataset type: json, json-unique, grammar, or regex
--structured-output-ratio	float	No	Fraction of requests with structured output constraints (default: 1.0)
--request-rate	float	No	Target request rate in requests/second (default: inf for max throughput)
--num-prompts	int	No	Number of prompts to benchmark
--output-len	int	No	Maximum output length in tokens
--json-schema-path	str	No	Path to custom JSON schema file
--structure-type	str	No	Structure type for guided decoding (e.g., json_object, json_schema)
--host	str	No	Server hostname (default: localhost)
--port	int	No	Server port (default: 8000)
--save-result	bool	No	Whether to save results to a JSON file
--result-dir	str	No	Directory for saving result files
--goodput	list[str]	No	Goodput criteria (e.g., "ttft:100" for TTFT under 100ms)

Outputs

Name	Type	Description
BenchmarkMetrics	dataclass	Comprehensive metrics including throughput, TTFT, TPOT, ITL, E2EL statistics
Console output	stdout	Formatted benchmark results printed to terminal
JSON results	.json file	Detailed results saved to file when --save-result is enabled
Evaluation report	stdout	Optional correctness evaluation of structured output (JSON validity check)

Usage Examples

# Benchmark JSON schema-constrained output with vLLM backend:
# python benchmarks/benchmark_serving_structured_output.py \
#     --backend vllm \
#     --model meta-llama/Llama-2-7b-hf \
#     --dataset json \
#     --structured-output-ratio 1.0 \
#     --request-rate 10 \
#     --num-prompts 1000

# Benchmark grammar-constrained output (SQL generation):
# python benchmarks/benchmark_serving_structured_output.py \
#     --backend vllm \
#     --model meta-llama/Llama-2-7b-hf \
#     --dataset grammar \
#     --num-prompts 500

# Benchmark with unique JSON schemas per request:
# python benchmarks/benchmark_serving_structured_output.py \
#     --backend vllm \
#     --model meta-llama/Llama-2-7b-hf \
#     --dataset json-unique \
#     --num-prompts 200

Related Pages

Environment:Vllm_project_Vllm_Benchmarks

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment