Implementation:Vllm project Vllm Benchmark Batch Invariance
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, Performance Testing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Measures the performance overhead of vLLM's VLLM_BATCH_INVARIANT mode by comparing throughput and latency with and without batch invariance enabled.
Description
This Python script benchmarks the impact of the VLLM_BATCH_INVARIANT environment variable, which ensures deterministic outputs regardless of batch composition. It runs the same workload twice: once as a baseline (VLLM_BATCH_INVARIANT=0) and once with batch invariance enabled (VLLM_BATCH_INVARIANT=1). The benchmark generates random prompts, executes multiple trials with varying batch sizes, and reports initialization time, average trial time, throughput (tokens/s), and prompts per second, along with a percentage comparison between modes.
Usage
Run this script directly on a machine with CUDA Hopper (SM90) or newer GPUs. Configuration is controlled entirely through environment variables (VLLM_BENCH_MODEL, VLLM_BENCH_TP_SIZE, VLLM_BENCH_BATCH_SIZE, etc.). The script is used by developers to validate that batch-invariant mode does not introduce unacceptable performance overhead.
Code Reference
Source Location
- Repository: vllm
- File: benchmarks/benchmark_batch_invariance.py
- Lines: 1-380
Signature
def _random_prompt(min_words: int = 1024, max_words: int = 1024 * 2) -> str:
"""Generate a random prompt for benchmarking."""
def run_benchmark_with_batch_invariant(
model: str,
tp_size: int,
max_batch_size: int,
num_trials: int,
min_prompt: int,
max_prompt: int,
max_tokens: int,
temperature: float,
gpu_mem_util: float,
max_model_len: int,
backend: str,
batch_invariant: bool,
seed: int = 12345,
) -> dict:
"""Run the benchmark with the specified configuration."""
def main():
"""Entry point: reads env vars, runs both modes, and compares results."""
Import
# This is a standalone executable script.
# Run directly from the command line:
python benchmarks/benchmark_batch_invariance.py
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| VLLM_BENCH_MODEL | env var (str) | No | Model to benchmark (default: "Qwen/Qwen3-1.7B") |
| VLLM_BENCH_TP_SIZE | env var (int) | No | Tensor parallel size (default: 1, use 8 for DeepSeek) |
| VLLM_BENCH_BATCH_SIZE | env var (int) | No | Maximum batch size (default: 128) |
| VLLM_BENCH_NUM_TRIALS | env var (int) | No | Number of trials to run (default: 5) |
| VLLM_BENCH_MIN_PROMPT | env var (int) | No | Minimum prompt length in words (default: 1024) |
| VLLM_BENCH_MAX_PROMPT | env var (int) | No | Maximum prompt length in words (default: 2048) |
| VLLM_BENCH_MAX_TOKENS | env var (int) | No | Maximum tokens to generate (default: 128) |
| VLLM_BENCH_TEMPERATURE | env var (float) | No | Sampling temperature (default: 0.0) |
| VLLM_BENCH_GPU_MEMORY_UTILIZATION | env var (float) | No | GPU memory utilization fraction (default: 0.4) |
| VLLM_BENCH_MAX_MODEL_LEN | env var (int) | No | Maximum model sequence length (default: 5120) |
| VLLM_BENCH_BACKEND | env var (str) | No | Attention backend (default: FLASH_ATTN) |
Outputs
| Name | Type | Description |
|---|---|---|
| Comparison report | stdout | Printed comparison of baseline vs batch-invariant mode metrics |
| init_time | float | Engine initialization time in seconds |
| avg_time | float | Average time per trial in seconds |
| min_time | float | Minimum trial time in seconds |
| max_time | float | Maximum trial time in seconds |
| throughput | float | Token generation throughput in tokens/second |
| prompts_per_sec | float | Prompts processed per second |
| trial_times | list[float] | Individual trial execution times |
| Exit code | int | 0 on success, 1 if platform requirements not met |
Usage Examples
# Default benchmark (Qwen3-1.7B on single GPU):
# python benchmarks/benchmark_batch_invariance.py
# Benchmark DeepSeek with 8 GPUs:
# VLLM_BENCH_MODEL="deepseek-ai/DeepSeek-V3" VLLM_BENCH_TP_SIZE=8 \
# python benchmarks/benchmark_batch_invariance.py
# Quick test with fewer trials and smaller batch:
# VLLM_BENCH_NUM_TRIALS=2 VLLM_BENCH_BATCH_SIZE=32 \
# python benchmarks/benchmark_batch_invariance.py