Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm Benchmark Batch Invariance

From Leeroopedia


Knowledge Sources
Domains Benchmarking, Performance Testing
Last Updated 2026-02-08 00:00 GMT

Overview

Measures the performance overhead of vLLM's VLLM_BATCH_INVARIANT mode by comparing throughput and latency with and without batch invariance enabled.

Description

This Python script benchmarks the impact of the VLLM_BATCH_INVARIANT environment variable, which ensures deterministic outputs regardless of batch composition. It runs the same workload twice: once as a baseline (VLLM_BATCH_INVARIANT=0) and once with batch invariance enabled (VLLM_BATCH_INVARIANT=1). The benchmark generates random prompts, executes multiple trials with varying batch sizes, and reports initialization time, average trial time, throughput (tokens/s), and prompts per second, along with a percentage comparison between modes.

Usage

Run this script directly on a machine with CUDA Hopper (SM90) or newer GPUs. Configuration is controlled entirely through environment variables (VLLM_BENCH_MODEL, VLLM_BENCH_TP_SIZE, VLLM_BENCH_BATCH_SIZE, etc.). The script is used by developers to validate that batch-invariant mode does not introduce unacceptable performance overhead.

Code Reference

Source Location

Signature

def _random_prompt(min_words: int = 1024, max_words: int = 1024 * 2) -> str:
    """Generate a random prompt for benchmarking."""

def run_benchmark_with_batch_invariant(
    model: str,
    tp_size: int,
    max_batch_size: int,
    num_trials: int,
    min_prompt: int,
    max_prompt: int,
    max_tokens: int,
    temperature: float,
    gpu_mem_util: float,
    max_model_len: int,
    backend: str,
    batch_invariant: bool,
    seed: int = 12345,
) -> dict:
    """Run the benchmark with the specified configuration."""

def main():
    """Entry point: reads env vars, runs both modes, and compares results."""

Import

# This is a standalone executable script.
# Run directly from the command line:
python benchmarks/benchmark_batch_invariance.py

I/O Contract

Inputs

Name Type Required Description
VLLM_BENCH_MODEL env var (str) No Model to benchmark (default: "Qwen/Qwen3-1.7B")
VLLM_BENCH_TP_SIZE env var (int) No Tensor parallel size (default: 1, use 8 for DeepSeek)
VLLM_BENCH_BATCH_SIZE env var (int) No Maximum batch size (default: 128)
VLLM_BENCH_NUM_TRIALS env var (int) No Number of trials to run (default: 5)
VLLM_BENCH_MIN_PROMPT env var (int) No Minimum prompt length in words (default: 1024)
VLLM_BENCH_MAX_PROMPT env var (int) No Maximum prompt length in words (default: 2048)
VLLM_BENCH_MAX_TOKENS env var (int) No Maximum tokens to generate (default: 128)
VLLM_BENCH_TEMPERATURE env var (float) No Sampling temperature (default: 0.0)
VLLM_BENCH_GPU_MEMORY_UTILIZATION env var (float) No GPU memory utilization fraction (default: 0.4)
VLLM_BENCH_MAX_MODEL_LEN env var (int) No Maximum model sequence length (default: 5120)
VLLM_BENCH_BACKEND env var (str) No Attention backend (default: FLASH_ATTN)

Outputs

Name Type Description
Comparison report stdout Printed comparison of baseline vs batch-invariant mode metrics
init_time float Engine initialization time in seconds
avg_time float Average time per trial in seconds
min_time float Minimum trial time in seconds
max_time float Maximum trial time in seconds
throughput float Token generation throughput in tokens/second
prompts_per_sec float Prompts processed per second
trial_times list[float] Individual trial execution times
Exit code int 0 on success, 1 if platform requirements not met

Usage Examples

# Default benchmark (Qwen3-1.7B on single GPU):
# python benchmarks/benchmark_batch_invariance.py

# Benchmark DeepSeek with 8 GPUs:
# VLLM_BENCH_MODEL="deepseek-ai/DeepSeek-V3" VLLM_BENCH_TP_SIZE=8 \
#     python benchmarks/benchmark_batch_invariance.py

# Quick test with fewer trials and smaller batch:
# VLLM_BENCH_NUM_TRIALS=2 VLLM_BENCH_BATCH_SIZE=32 \
#     python benchmarks/benchmark_batch_invariance.py

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment