Heuristic:InternLM Lmdeploy Max Batch Size Selection
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Inference |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Device-aware automatic batch size selection that maps GPU hardware to optimal maximum batch sizes, ranging from 128 (default) to 1024 (H100/H200).
Description
LMDeploy auto-detects the GPU model via `torch.cuda.get_device_name()` and selects an appropriate `max_batch_size` if the user does not explicitly specify one. This prevents oversubscription on smaller GPUs while maximizing throughput on high-end hardware. The mapping is based on GPU memory bandwidth and VRAM capacity. Non-CUDA devices (Ascend, MACA, Cambricon) use a fixed default of 256.
Usage
Use this heuristic when:
- You leave `max_batch_size=None` (the default) and want to understand what value will be chosen.
- You are tuning for maximum throughput on a specific GPU.
- You are deploying to mixed GPU hardware and need consistent behavior.
- You encounter scheduling or memory issues from an inappropriate batch size.
The Insight (Rule of Thumb)
- GPU Batch Size Map:
- A100 / A800: 384
- H100 / H800 / H200 / L20Y: 1024
- All other CUDA GPUs (3090, 4090, etc.): 128 (conservative default)
- Ascend / MACA / Cambricon: 256
- Override: Set `max_batch_size` explicitly in engine config to override auto-detection.
- Trade-off: Larger batch size increases throughput but requires more KV cache memory. If memory is tight, reduce batch size manually.
Reasoning
High-end datacenter GPUs (H100 with 80GB HBM3, A100 with 80GB HBM2e) have sufficient memory bandwidth and VRAM to handle large batches efficiently. Consumer GPUs (3090 with 24GB, 4090 with 24GB) have less VRAM, so a conservative default of 128 prevents OOM. The mapping was determined through internal benchmarking to balance throughput and stability.
The batch size interacts with `cache_max_entry_count`: a larger batch size means more concurrent KV cache entries needed. If `max_batch_size * avg_sequence_length` exceeds available KV blocks, requests will be queued or rejected.
Code evidence from `lmdeploy/utils.py:363-386`:
def get_max_batch_size(device_type: str):
assert device_type in ['cuda', 'ascend', 'maca', 'camb']
if device_type == 'cuda':
max_batch_size_map = {
'a100': 384, 'a800': 384,
'h100': 1024, 'h800': 1024,
'l20y': 1024, 'h200': 1024
}
import torch
device_name = torch.cuda.get_device_name(0).lower()
for name, size in max_batch_size_map.items():
if name in device_name:
return size
return 128 # default for unknown GPUs
elif device_type == 'ascend':
return 256
elif device_type == 'maca':
return 256
elif device_type == 'camb':
return 256