Heuristic:Lm sys FastChat GPU Memory Allocation Strategy
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Infrastructure |
| Last Updated | 2026-02-07 04:00 GMT |
Overview
Memory management strategy that allocates 85% of available GPU VRAM per device for multi-GPU inference and uses sequential device mapping for heterogeneous GPU setups.
Description
FastChat's model loading uses a conservative 85% GPU memory allocation to prevent CUDA OOM errors during inference. When multiple GPUs are available and no explicit memory limit is set, it queries each GPU's available memory, reserves 85% of it, and uses `device_map="sequential"` (instead of `"auto"`) to fill GPUs one at a time. This handles heterogeneous GPU configurations where GPUs have different VRAM sizes. After each inference request, all device caches are explicitly cleared (CUDA, XPU, NPU) with garbage collection.
Usage
Use this heuristic when deploying models across multiple GPUs or when tuning GPU memory allocation for inference workers. The 85% limit provides a safety margin for intermediate activations during generation while maximizing model capacity.
The Insight (Rule of Thumb)
- Action: Let FastChat auto-detect available GPU memory and allocate 85%, or set explicit limits via `--max-gpu-memory`.
- Value: 85% of available VRAM per GPU (0.85 multiplier); vLLM and SGLang workers default to 90% (`--gpu_memory_utilization 0.9`).
- Trade-off: The 15% headroom prevents OOM during inference spikes but wastes some VRAM. For production with stable workloads, increasing to 90% (like vLLM) may be acceptable.
- Cache clearing: Always call `torch.cuda.empty_cache()` (and platform equivalents) after each generation to free CUDA memory for the next request.
Reasoning
GPU memory management during LLM inference is critical because:
- KV cache grows linearly with sequence length and batch size during generation
- Different requests may need different amounts of memory depending on prompt length
- Multi-GPU setups often have different available memory per device (due to other processes)
The 85% allocation with `device_map="sequential"` ensures that:
- Each GPU is filled to its individual capacity (not assuming uniform VRAM)
- A 15% buffer absorbs KV cache growth during generation
- The `sequential` strategy avoids the `auto` strategy's round-robin behavior which can fail with mixed GPU sizes
The explicit cache clearing pattern (found in 7+ files) prevents memory fragmentation across requests. The multi-platform approach (CUDA + XPU + NPU) ensures portability.
Code Evidence
85% allocation strategy from `fastchat/model/model_adapter.py:241-249`:
if num_gpus != 1:
kwargs["device_map"] = "auto"
if max_gpu_memory is None:
kwargs["device_map"] = "sequential" # This is important for not the same VRAM sizes
available_gpu_memory = get_gpu_memory(num_gpus)
kwargs["max_memory"] = {
i: str(int(available_gpu_memory[i] * 0.85)) + "GiB"
for i in range(num_gpus)
}
CPU offloading memory calculation from `fastchat/model/model_adapter.py:288-290`:
if "max_memory" in kwargs:
kwargs["max_memory"]["cpu"] = (
str(math.floor(psutil.virtual_memory().available / 2**20)) + "Mib"
)
Multi-platform cache clearing from `fastchat/serve/inference.py:310-316`:
del past_key_values, out
gc.collect()
torch.cuda.empty_cache()
if device == "xpu":
torch.xpu.empty_cache()
if device == "npu":
torch.npu.empty_cache()
vLLM default 90% utilization from `fastchat/serve/vllm_worker.py:272-280`:
parser.add_argument(
"--gpu_memory_utilization",
type=float,
default=0.9,
)