Heuristic:InternLM Lmdeploy Max Batch Size Selection

Knowledge Sources	LMDeploy Internal benchmarking per GPU type
Domains	Optimization, Inference
Last Updated	2026-02-07 15:00 GMT

Overview

Device-aware automatic batch size selection that maps GPU hardware to optimal maximum batch sizes, ranging from 128 (default) to 1024 (H100/H200).

Description

LMDeploy auto-detects the GPU model via `torch.cuda.get_device_name()` and selects an appropriate `max_batch_size` if the user does not explicitly specify one. This prevents oversubscription on smaller GPUs while maximizing throughput on high-end hardware. The mapping is based on GPU memory bandwidth and VRAM capacity. Non-CUDA devices (Ascend, MACA, Cambricon) use a fixed default of 256.

Usage

Use this heuristic when:

You leave `max_batch_size=None` (the default) and want to understand what value will be chosen.
You are tuning for maximum throughput on a specific GPU.
You are deploying to mixed GPU hardware and need consistent behavior.
You encounter scheduling or memory issues from an inappropriate batch size.

The Insight (Rule of Thumb)

GPU Batch Size Map:
- A100 / A800: 384
- H100 / H800 / H200 / L20Y: 1024
- All other CUDA GPUs (3090, 4090, etc.): 128 (conservative default)
- Ascend / MACA / Cambricon: 256
Override: Set `max_batch_size` explicitly in engine config to override auto-detection.
Trade-off: Larger batch size increases throughput but requires more KV cache memory. If memory is tight, reduce batch size manually.

Reasoning

High-end datacenter GPUs (H100 with 80GB HBM3, A100 with 80GB HBM2e) have sufficient memory bandwidth and VRAM to handle large batches efficiently. Consumer GPUs (3090 with 24GB, 4090 with 24GB) have less VRAM, so a conservative default of 128 prevents OOM. The mapping was determined through internal benchmarking to balance throughput and stability.

The batch size interacts with `cache_max_entry_count`: a larger batch size means more concurrent KV cache entries needed. If `max_batch_size * avg_sequence_length` exceeds available KV blocks, requests will be queued or rejected.

Code evidence from `lmdeploy/utils.py:363-386`:

def get_max_batch_size(device_type: str):
    assert device_type in ['cuda', 'ascend', 'maca', 'camb']
    if device_type == 'cuda':
        max_batch_size_map = {
            'a100': 384, 'a800': 384,
            'h100': 1024, 'h800': 1024,
            'l20y': 1024, 'h200': 1024
        }
        import torch
        device_name = torch.cuda.get_device_name(0).lower()
        for name, size in max_batch_size_map.items():
            if name in device_name:
                return size
        return 128  # default for unknown GPUs
    elif device_type == 'ascend':
        return 256
    elif device_type == 'maca':
        return 256
    elif device_type == 'camb':
        return 256

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment