Heuristic:Lm sys FastChat GPU Memory Allocation Strategy

Knowledge Sources	lm-sys/FastChat Multi-GPU deployment patterns
Domains	Optimization, Infrastructure
Last Updated	2026-02-07 04:00 GMT

Overview

Memory management strategy that allocates 85% of available GPU VRAM per device for multi-GPU inference and uses sequential device mapping for heterogeneous GPU setups.

Description

FastChat's model loading uses a conservative 85% GPU memory allocation to prevent CUDA OOM errors during inference. When multiple GPUs are available and no explicit memory limit is set, it queries each GPU's available memory, reserves 85% of it, and uses `device_map="sequential"` (instead of `"auto"`) to fill GPUs one at a time. This handles heterogeneous GPU configurations where GPUs have different VRAM sizes. After each inference request, all device caches are explicitly cleared (CUDA, XPU, NPU) with garbage collection.

Usage

Use this heuristic when deploying models across multiple GPUs or when tuning GPU memory allocation for inference workers. The 85% limit provides a safety margin for intermediate activations during generation while maximizing model capacity.

The Insight (Rule of Thumb)

Action: Let FastChat auto-detect available GPU memory and allocate 85%, or set explicit limits via `--max-gpu-memory`.
Value: 85% of available VRAM per GPU (0.85 multiplier); vLLM and SGLang workers default to 90% (`--gpu_memory_utilization 0.9`).
Trade-off: The 15% headroom prevents OOM during inference spikes but wastes some VRAM. For production with stable workloads, increasing to 90% (like vLLM) may be acceptable.
Cache clearing: Always call `torch.cuda.empty_cache()` (and platform equivalents) after each generation to free CUDA memory for the next request.

Reasoning

GPU memory management during LLM inference is critical because:

KV cache grows linearly with sequence length and batch size during generation
Different requests may need different amounts of memory depending on prompt length
Multi-GPU setups often have different available memory per device (due to other processes)

The 85% allocation with `device_map="sequential"` ensures that:

Each GPU is filled to its individual capacity (not assuming uniform VRAM)
A 15% buffer absorbs KV cache growth during generation
The `sequential` strategy avoids the `auto` strategy's round-robin behavior which can fail with mixed GPU sizes

The explicit cache clearing pattern (found in 7+ files) prevents memory fragmentation across requests. The multi-platform approach (CUDA + XPU + NPU) ensures portability.

Code Evidence

85% allocation strategy from `fastchat/model/model_adapter.py:241-249`:

if num_gpus != 1:
    kwargs["device_map"] = "auto"
    if max_gpu_memory is None:
        kwargs["device_map"] = "sequential"  # This is important for not the same VRAM sizes
        available_gpu_memory = get_gpu_memory(num_gpus)
        kwargs["max_memory"] = {
            i: str(int(available_gpu_memory[i] * 0.85)) + "GiB"
            for i in range(num_gpus)
        }

CPU offloading memory calculation from `fastchat/model/model_adapter.py:288-290`:

if "max_memory" in kwargs:
    kwargs["max_memory"]["cpu"] = (
        str(math.floor(psutil.virtual_memory().available / 2**20)) + "Mib"
    )

Multi-platform cache clearing from `fastchat/serve/inference.py:310-316`:

del past_key_values, out
gc.collect()
torch.cuda.empty_cache()
if device == "xpu":
    torch.xpu.empty_cache()
if device == "npu":
    torch.npu.empty_cache()

vLLM default 90% utilization from `fastchat/serve/vllm_worker.py:272-280`:

parser.add_argument(
    "--gpu_memory_utilization",
    type=float,
    default=0.9,
)

Related Pages

Implementation:Lm_sys_FastChat_ModelWorker_Load_And_Generate

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment