Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Vllm project Vllm Batch Size Hardware Scaling

From Leeroopedia



Metadata

Field Value
Heuristic ID Batch_Size_Hardware_Scaling
Project Vllm_project_Vllm
Category Performance Tuning
Scope Engine configuration, hardware-aware defaults
Primary Source vllm/engine/arg_utils.py:1848-1929 (get_batch_defaults method)
Supporting Sources vllm/config.py (SchedulerConfig), vllm/envs.py:1507-1510
Status Active

Overview

vLLM dynamically selects default values for max_num_batched_tokens and max_num_seqs based on the detected hardware platform and the usage context (offline LLM class vs. OPENAI_API_SERVER). This heuristic captures the tribal knowledge encoded in the get_batch_defaults method and related configuration, explaining why certain hardware receives different defaults and when an operator should override them.

Description

The get_batch_defaults method in vllm/engine/arg_utils.py implements a tiered decision tree that maps hardware characteristics to batch-size defaults. The logic branches on three axes:

  • Device type -- GPU, TPU, or CPU
  • Device memory -- specifically whether VRAM is >= 70 GiB
  • Device name -- an explicit check to exclude A100s from the high-memory tier

GPU Defaults

H100 / H200-class GPUs (>= 70 GiB VRAM, not A100)

if device_memory >= 70 * GiB_bytes and "a100" not in device_name:
    default_max_num_batched_tokens = {
        UsageContext.LLM_CLASS: 16384,
        UsageContext.OPENAI_API_SERVER: 8192,
    }
    default_max_num_seqs = {
        UsageContext.LLM_CLASS: 1024,
        UsageContext.OPENAI_API_SERVER: 1024,
    }

These GPUs have both the memory capacity and the memory bandwidth to sustain large batch sizes without bottlenecking.

A100 and Smaller GPUs

else:
    default_max_num_batched_tokens = {
        UsageContext.LLM_CLASS: 8192,
        UsageContext.OPENAI_API_SERVER: 2048,
    }
    default_max_num_seqs = {
        UsageContext.LLM_CLASS: 256,
        UsageContext.OPENAI_API_SERVER: 256,
    }

The A100 is explicitly excluded from the high-memory tier despite having 80 GiB of VRAM. The comment at arg_utils.py:1874-1876 explains:

# NOTE(Kuntai): Setting large `max_num_batched_tokens` for A100 reduces
# throughput, see PR #17885 for more details.
# So here we do an extra device name check to prevent such regression.

This is a critical piece of tribal knowledge: naively raising max_num_batched_tokens on A100 hardware actually degrades throughput, contrary to the intuition that more memory should permit larger batches. The root cause (documented in PR #17885) relates to the A100's different compute-to-memory-bandwidth ratio compared to H100-class parts.

TPU Defaults

TPU defaults are set per chip generation (arg_utils.py:1898-1916):

TPU Generation LLM_CLASS (max_num_batched_tokens) OPENAI_API_SERVER (max_num_batched_tokens)
V6E 2048 1024
V5E 1024 512
V5P 512 256

The values decrease from V6E to V5P, reflecting the different memory and compute profiles of each TPU generation.

CPU Defaults

CPU defaults scale linearly with world_size (the number of parallel workers), as documented at arg_utils.py:1918-1927:

  • LLM_CLASS: 4096 * world_size
  • OPENAI_API_SERVER: 2048 * world_size

This linear scaling is appropriate because CPU inference distributes work across cores and NUMA nodes, so adding workers proportionally increases aggregate throughput capacity.

Global Defaults

The scheduler configuration defines a global fallback:

DEFAULT_MAX_NUM_SEQS: ClassVar[int] = 128

This value is used when no hardware-specific override is applied.

Shared Experts Stream Threshold

A related tuning knob in vllm/envs.py:1507-1510 interacts with batch size:

# We found out that for large batch sizes, the separate stream
# execution is not beneficial (most likely because of the input clone)
"VLLM_SHARED_EXPERTS_STREAM_TOKEN_THRESHOLD": 256

When the number of tokens in a batch exceeds 256, vLLM disables the separate-stream optimization for shared experts in Mixture-of-Experts models, because the overhead of cloning the input outweighs the benefit of parallel execution at larger batch sizes.

Usage

When to Apply This Heuristic

  • Deploying vLLM on new hardware -- check which tier the hardware falls into and verify the auto-detected defaults match expectations.
  • Troubleshooting OOM errors -- if running on consumer GPUs (e.g., RTX 4090 with 24 GiB), the defaults (8192/2048) may still be too high; reduce max_num_batched_tokens accordingly.
  • Optimizing throughput on A100 -- do not increase max_num_batched_tokens beyond the defaults. This is a known regression documented in PR #17885.
  • Scaling CPU inference -- understand that defaults grow with world_size; adjust if memory is constrained.

How to Override

Pass explicit values via the engine arguments:

--max-num-batched-tokens 8192 --max-num-seqs 256

Or in the Python API:

llm = LLM(model="...", max_num_batched_tokens=8192, max_num_seqs=256)

The Insight (Rule of Thumb)

Let vLLM auto-detect batch sizes based on the GPU; override only when you have a specific reason.

  • H100 / H200 (>= 70 GiB, not A100): max_num_batched_tokens = 16384 (LLM) / 8192 (API server). These GPUs can sustain large batches.
  • A100: Do NOT set high max_num_batched_tokens -- it reduces throughput (PR #17885 finding). Stick with 8192 (LLM) / 2048 (API server).
  • Consumer GPUs (< 70 GiB): Start with defaults (8192 / 2048). Reduce if encountering OOM errors.
  • TPU: Follow the generation-specific defaults (V6E > V5E > V5P).
  • CPU: Defaults scale linearly with world_size (4096 * world_size / 2048 * world_size).

Trade-off: Larger batches increase throughput but consume more memory and can increase per-request latency. On some hardware (notably A100), larger batches are actively counterproductive due to compute/bandwidth characteristics.

Reasoning

The batch-size defaults encode several non-obvious insights:

  1. Memory alone does not determine optimal batch size. The A100 has 80 GiB of HBM2e, which exceeds the 70 GiB threshold, yet it is explicitly excluded from the high-batch tier. The bottleneck on A100 is not memory capacity but the interaction between batch size and compute throughput (see PR #17885).
  2. Usage context matters. The LLM_CLASS context (offline batch processing) consistently receives higher defaults than OPENAI_API_SERVER (online serving). Offline workloads prioritize throughput over latency, so larger batches are appropriate.
  3. TPU defaults are conservative. Compared to GPU defaults, TPU batch sizes are notably smaller, reflecting the different memory architecture and compilation constraints of TPU workloads.
  4. CPU scaling is linear. Unlike GPUs where a single device handles the full batch, CPU inference distributes across workers, so batch capacity scales proportionally.
  5. Shared expert streams break down at scale. The VLLM_SHARED_EXPERTS_STREAM_TOKEN_THRESHOLD of 256 tokens reflects a crossover point where the cost of input cloning exceeds the benefit of parallel stream execution in MoE models.

These defaults represent accumulated benchmarking results and regression fixes. Changing them without hardware-specific profiling is likely to degrade performance.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment