Implementation:Vllm project Vllm Benchmark Batch Invariance

Knowledge Sources	vllm
Domains	Benchmarking, Performance Testing
Last Updated	2026-02-08 00:00 GMT

Overview

Measures the performance overhead of vLLM's VLLM_BATCH_INVARIANT mode by comparing throughput and latency with and without batch invariance enabled.

Description

This Python script benchmarks the impact of the VLLM_BATCH_INVARIANT environment variable, which ensures deterministic outputs regardless of batch composition. It runs the same workload twice: once as a baseline (VLLM_BATCH_INVARIANT=0) and once with batch invariance enabled (VLLM_BATCH_INVARIANT=1). The benchmark generates random prompts, executes multiple trials with varying batch sizes, and reports initialization time, average trial time, throughput (tokens/s), and prompts per second, along with a percentage comparison between modes.

Usage

Run this script directly on a machine with CUDA Hopper (SM90) or newer GPUs. Configuration is controlled entirely through environment variables (VLLM_BENCH_MODEL, VLLM_BENCH_TP_SIZE, VLLM_BENCH_BATCH_SIZE, etc.). The script is used by developers to validate that batch-invariant mode does not introduce unacceptable performance overhead.

Code Reference

Source Location

Repository: vllm
File: benchmarks/benchmark_batch_invariance.py
Lines: 1-380

Signature

def _random_prompt(min_words: int = 1024, max_words: int = 1024 * 2) -> str:
    """Generate a random prompt for benchmarking."""

def run_benchmark_with_batch_invariant(
    model: str,
    tp_size: int,
    max_batch_size: int,
    num_trials: int,
    min_prompt: int,
    max_prompt: int,
    max_tokens: int,
    temperature: float,
    gpu_mem_util: float,
    max_model_len: int,
    backend: str,
    batch_invariant: bool,
    seed: int = 12345,
) -> dict:
    """Run the benchmark with the specified configuration."""

def main():
    """Entry point: reads env vars, runs both modes, and compares results."""

Import

# This is a standalone executable script.
# Run directly from the command line:
python benchmarks/benchmark_batch_invariance.py

I/O Contract

Inputs

Name	Type	Required	Description
VLLM_BENCH_MODEL	env var (str)	No	Model to benchmark (default: "Qwen/Qwen3-1.7B")
VLLM_BENCH_TP_SIZE	env var (int)	No	Tensor parallel size (default: 1, use 8 for DeepSeek)
VLLM_BENCH_BATCH_SIZE	env var (int)	No	Maximum batch size (default: 128)
VLLM_BENCH_NUM_TRIALS	env var (int)	No	Number of trials to run (default: 5)
VLLM_BENCH_MIN_PROMPT	env var (int)	No	Minimum prompt length in words (default: 1024)
VLLM_BENCH_MAX_PROMPT	env var (int)	No	Maximum prompt length in words (default: 2048)
VLLM_BENCH_MAX_TOKENS	env var (int)	No	Maximum tokens to generate (default: 128)
VLLM_BENCH_TEMPERATURE	env var (float)	No	Sampling temperature (default: 0.0)
VLLM_BENCH_GPU_MEMORY_UTILIZATION	env var (float)	No	GPU memory utilization fraction (default: 0.4)
VLLM_BENCH_MAX_MODEL_LEN	env var (int)	No	Maximum model sequence length (default: 5120)
VLLM_BENCH_BACKEND	env var (str)	No	Attention backend (default: FLASH_ATTN)

Outputs

Name	Type	Description
Comparison report	stdout	Printed comparison of baseline vs batch-invariant mode metrics
init_time	float	Engine initialization time in seconds
avg_time	float	Average time per trial in seconds
min_time	float	Minimum trial time in seconds
max_time	float	Maximum trial time in seconds
throughput	float	Token generation throughput in tokens/second
prompts_per_sec	float	Prompts processed per second
trial_times	list[float]	Individual trial execution times
Exit code	int	0 on success, 1 if platform requirements not met

Usage Examples

# Default benchmark (Qwen3-1.7B on single GPU):
# python benchmarks/benchmark_batch_invariance.py

# Benchmark DeepSeek with 8 GPUs:
# VLLM_BENCH_MODEL="deepseek-ai/DeepSeek-V3" VLLM_BENCH_TP_SIZE=8 \
#     python benchmarks/benchmark_batch_invariance.py

# Quick test with fewer trials and smaller batch:
# VLLM_BENCH_NUM_TRIALS=2 VLLM_BENCH_BATCH_SIZE=32 \
#     python benchmarks/benchmark_batch_invariance.py

Related Pages

Environment:Vllm_project_Vllm_CUDA_Hopper

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment