Heuristic:Vllm project Vllm Batch Size Hardware Scaling
Metadata
| Field | Value |
|---|---|
| Heuristic ID | Batch_Size_Hardware_Scaling |
| Project | Vllm_project_Vllm |
| Category | Performance Tuning |
| Scope | Engine configuration, hardware-aware defaults |
| Primary Source | vllm/engine/arg_utils.py:1848-1929 (get_batch_defaults method)
|
| Supporting Sources | vllm/config.py (SchedulerConfig), vllm/envs.py:1507-1510
|
| Status | Active |
Overview
vLLM dynamically selects default values for max_num_batched_tokens and max_num_seqs based on the detected hardware platform and the usage context (offline LLM class vs. OPENAI_API_SERVER). This heuristic captures the tribal knowledge encoded in the get_batch_defaults method and related configuration, explaining why certain hardware receives different defaults and when an operator should override them.
Description
The get_batch_defaults method in vllm/engine/arg_utils.py implements a tiered decision tree that maps hardware characteristics to batch-size defaults. The logic branches on three axes:
- Device type -- GPU, TPU, or CPU
- Device memory -- specifically whether VRAM is >= 70 GiB
- Device name -- an explicit check to exclude A100s from the high-memory tier
GPU Defaults
H100 / H200-class GPUs (>= 70 GiB VRAM, not A100)
if device_memory >= 70 * GiB_bytes and "a100" not in device_name:
default_max_num_batched_tokens = {
UsageContext.LLM_CLASS: 16384,
UsageContext.OPENAI_API_SERVER: 8192,
}
default_max_num_seqs = {
UsageContext.LLM_CLASS: 1024,
UsageContext.OPENAI_API_SERVER: 1024,
}
These GPUs have both the memory capacity and the memory bandwidth to sustain large batch sizes without bottlenecking.
A100 and Smaller GPUs
else:
default_max_num_batched_tokens = {
UsageContext.LLM_CLASS: 8192,
UsageContext.OPENAI_API_SERVER: 2048,
}
default_max_num_seqs = {
UsageContext.LLM_CLASS: 256,
UsageContext.OPENAI_API_SERVER: 256,
}
The A100 is explicitly excluded from the high-memory tier despite having 80 GiB of VRAM. The comment at arg_utils.py:1874-1876 explains:
# NOTE(Kuntai): Setting large `max_num_batched_tokens` for A100 reduces # throughput, see PR #17885 for more details. # So here we do an extra device name check to prevent such regression.
This is a critical piece of tribal knowledge: naively raising max_num_batched_tokens on A100 hardware actually degrades throughput, contrary to the intuition that more memory should permit larger batches. The root cause (documented in PR #17885) relates to the A100's different compute-to-memory-bandwidth ratio compared to H100-class parts.
TPU Defaults
TPU defaults are set per chip generation (arg_utils.py:1898-1916):
| TPU Generation | LLM_CLASS (max_num_batched_tokens) | OPENAI_API_SERVER (max_num_batched_tokens) |
|---|---|---|
| V6E | 2048 | 1024 |
| V5E | 1024 | 512 |
| V5P | 512 | 256 |
The values decrease from V6E to V5P, reflecting the different memory and compute profiles of each TPU generation.
CPU Defaults
CPU defaults scale linearly with world_size (the number of parallel workers), as documented at arg_utils.py:1918-1927:
- LLM_CLASS:
4096 * world_size - OPENAI_API_SERVER:
2048 * world_size
This linear scaling is appropriate because CPU inference distributes work across cores and NUMA nodes, so adding workers proportionally increases aggregate throughput capacity.
Global Defaults
The scheduler configuration defines a global fallback:
DEFAULT_MAX_NUM_SEQS: ClassVar[int] = 128
This value is used when no hardware-specific override is applied.
A related tuning knob in vllm/envs.py:1507-1510 interacts with batch size:
# We found out that for large batch sizes, the separate stream # execution is not beneficial (most likely because of the input clone) "VLLM_SHARED_EXPERTS_STREAM_TOKEN_THRESHOLD": 256
When the number of tokens in a batch exceeds 256, vLLM disables the separate-stream optimization for shared experts in Mixture-of-Experts models, because the overhead of cloning the input outweighs the benefit of parallel execution at larger batch sizes.
Usage
When to Apply This Heuristic
- Deploying vLLM on new hardware -- check which tier the hardware falls into and verify the auto-detected defaults match expectations.
- Troubleshooting OOM errors -- if running on consumer GPUs (e.g., RTX 4090 with 24 GiB), the defaults (8192/2048) may still be too high; reduce
max_num_batched_tokensaccordingly. - Optimizing throughput on A100 -- do not increase
max_num_batched_tokensbeyond the defaults. This is a known regression documented in PR #17885. - Scaling CPU inference -- understand that defaults grow with
world_size; adjust if memory is constrained.
How to Override
Pass explicit values via the engine arguments:
--max-num-batched-tokens 8192 --max-num-seqs 256
Or in the Python API:
llm = LLM(model="...", max_num_batched_tokens=8192, max_num_seqs=256)
The Insight (Rule of Thumb)
Let vLLM auto-detect batch sizes based on the GPU; override only when you have a specific reason.
- H100 / H200 (>= 70 GiB, not A100):
max_num_batched_tokens= 16384 (LLM) / 8192 (API server). These GPUs can sustain large batches. - A100: Do NOT set high
max_num_batched_tokens-- it reduces throughput (PR #17885 finding). Stick with 8192 (LLM) / 2048 (API server). - Consumer GPUs (< 70 GiB): Start with defaults (8192 / 2048). Reduce if encountering OOM errors.
- TPU: Follow the generation-specific defaults (V6E > V5E > V5P).
- CPU: Defaults scale linearly with
world_size(4096 * world_size / 2048 * world_size).
Trade-off: Larger batches increase throughput but consume more memory and can increase per-request latency. On some hardware (notably A100), larger batches are actively counterproductive due to compute/bandwidth characteristics.
Reasoning
The batch-size defaults encode several non-obvious insights:
- Memory alone does not determine optimal batch size. The A100 has 80 GiB of HBM2e, which exceeds the 70 GiB threshold, yet it is explicitly excluded from the high-batch tier. The bottleneck on A100 is not memory capacity but the interaction between batch size and compute throughput (see PR #17885).
- Usage context matters. The
LLM_CLASScontext (offline batch processing) consistently receives higher defaults thanOPENAI_API_SERVER(online serving). Offline workloads prioritize throughput over latency, so larger batches are appropriate. - TPU defaults are conservative. Compared to GPU defaults, TPU batch sizes are notably smaller, reflecting the different memory architecture and compilation constraints of TPU workloads.
- CPU scaling is linear. Unlike GPUs where a single device handles the full batch, CPU inference distributes across workers, so batch capacity scales proportionally.
- Shared expert streams break down at scale. The
VLLM_SHARED_EXPERTS_STREAM_TOKEN_THRESHOLDof 256 tokens reflects a crossover point where the cost of input cloning exceeds the benefit of parallel stream execution in MoE models.
These defaults represent accumulated benchmarking results and regression fixes. Changing them without hardware-specific profiling is likely to degrade performance.
Related Pages
- Implementation:Vllm_project_Vllm_EngineArgs_Init -- Where engine arguments are parsed and
get_batch_defaultsis invoked - Implementation:Vllm_project_Vllm_LLM_Init -- The offline LLM class initialization that consumes these defaults
- Principle:Vllm_project_Vllm_Engine_Configuration -- Broader principles governing engine configuration choices