Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm Benchmark Serving Structured Output

From Leeroopedia


Knowledge Sources
Domains Benchmarking, Structured Output, LLM Serving
Last Updated 2026-02-08 00:00 GMT

Overview

Benchmarks online serving throughput and latency for structured output generation (JSON schema, grammar, and regex constrained outputs) across multiple LLM serving backends.

Description

This Python script measures the performance of LLM serving backends when generating structured outputs. It supports JSON schema, JSON-unique (per-request unique schemas), grammar (BNF), and regex constrained generation modes. The benchmark sends async requests at configurable rates, measures TTFT (time to first token), TPOT (time per output token), ITL (inter-token latency), and E2EL (end-to-end latency) metrics, calculates request throughput and goodput (successful requests only), and optionally evaluates output correctness. Results are saved to JSON for analysis.

Usage

Start a vLLM server (or other compatible backend) on one machine, then run this script as a client to benchmark structured output performance. The script supports multiple backends via the --backend flag and offers fine-grained control over structured output ratio, request rate, number of prompts, output length, and schema configuration. It imports request functions from backend_request_func.py.

Code Reference

Source Location

Signature

@dataclass
class BenchmarkMetrics:
    completed: int
    total_input: int
    total_output: int
    request_throughput: float
    request_goodput: float
    output_throughput: float
    total_token_throughput: float
    mean_ttft_ms: float
    median_ttft_ms: float
    std_ttft_ms: float
    percentiles_ttft_ms: list[tuple[float, float]]
    mean_tpot_ms: float
    median_tpot_ms: float
    std_tpot_ms: float
    percentiles_tpot_ms: list[tuple[float, float]]
    mean_itl_ms: float
    median_itl_ms: float
    std_itl_ms: float
    percentiles_itl_ms: list[tuple[float, float]]
    mean_e2el_ms: float
    median_e2el_ms: float
    std_e2el_ms: float
    percentiles_e2el_ms: list[tuple[float, float]]

@dataclass
class SampleRequest:
    prompt: str
    prompt_len: int
    expected_output_len: int
    schema: dict
    structure_type: str
    completion: str = None

def sample_requests(tokenizer: PreTrainedTokenizerBase, args: argparse.Namespace) -> list[SampleRequest]
async def get_request(input_requests: list[SampleRequest], request_rate: float) -> AsyncGenerator[SampleRequest, None]
async def benchmark(backend: str, api_url: str, ...) -> BenchmarkMetrics
def evaluate(args: argparse.Namespace, output_list: list[RequestFuncOutput], ...)
def main(args: argparse.Namespace)

Import

# This is a standalone executable script.
# Run from the command line:
python benchmarks/benchmark_serving_structured_output.py \
    --backend vllm \
    --model <your_model> \
    --dataset json \
    --structured-output-ratio 1.0 \
    --request-rate 10 \
    --num-prompts 1000

I/O Contract

Inputs

Name Type Required Description
--backend str Yes Backend type: vllm, tgi, openai, openai-chat, lmdeploy, sglang, etc.
--model str Yes Model name or path served by the backend
--dataset str Yes Dataset type: json, json-unique, grammar, or regex
--structured-output-ratio float No Fraction of requests with structured output constraints (default: 1.0)
--request-rate float No Target request rate in requests/second (default: inf for max throughput)
--num-prompts int No Number of prompts to benchmark
--output-len int No Maximum output length in tokens
--json-schema-path str No Path to custom JSON schema file
--structure-type str No Structure type for guided decoding (e.g., json_object, json_schema)
--host str No Server hostname (default: localhost)
--port int No Server port (default: 8000)
--save-result bool No Whether to save results to a JSON file
--result-dir str No Directory for saving result files
--goodput list[str] No Goodput criteria (e.g., "ttft:100" for TTFT under 100ms)

Outputs

Name Type Description
BenchmarkMetrics dataclass Comprehensive metrics including throughput, TTFT, TPOT, ITL, E2EL statistics
Console output stdout Formatted benchmark results printed to terminal
JSON results .json file Detailed results saved to file when --save-result is enabled
Evaluation report stdout Optional correctness evaluation of structured output (JSON validity check)

Usage Examples

# Benchmark JSON schema-constrained output with vLLM backend:
# python benchmarks/benchmark_serving_structured_output.py \
#     --backend vllm \
#     --model meta-llama/Llama-2-7b-hf \
#     --dataset json \
#     --structured-output-ratio 1.0 \
#     --request-rate 10 \
#     --num-prompts 1000

# Benchmark grammar-constrained output (SQL generation):
# python benchmarks/benchmark_serving_structured_output.py \
#     --backend vllm \
#     --model meta-llama/Llama-2-7b-hf \
#     --dataset grammar \
#     --num-prompts 500

# Benchmark with unique JSON schemas per request:
# python benchmarks/benchmark_serving_structured_output.py \
#     --backend vllm \
#     --model meta-llama/Llama-2-7b-hf \
#     --dataset json-unique \
#     --num-prompts 200

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment