Heuristic:Mlc ai Mlc llm GPU Memory Budget Tuning
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Memory_Management |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Memory budget tuning guide for configuring `gpu_memory_utilization` and understanding the automatic KV cache capacity inference based on available GPU memory.
Description
MLC-LLM automatically infers the maximum KV cache token capacity based on the available GPU memory, model parameter size, temporary buffer requirements, and the `gpu_memory_utilization` fraction. The engine estimates memory for model weights, KV cache (with per-token byte cost), auxiliary workspace, model workspace, and logit processor workspace. The temporary buffer size is magnified by 2x for safety. When the inferred capacity is insufficient, the engine provides detailed diagnostic messages suggesting remediation steps.
Usage
Apply this heuristic when encountering Insufficient GPU memory errors, when trying to maximize batch size or context length, or when tuning serving throughput. The key lever is the `gpu_memory_utilization` parameter (defaults to 0.85).
The Insight (Rule of Thumb)
- Action: Set `gpu_memory_utilization` to control the fraction of GPU memory used by MLC-LLM.
- Value: Default is `0.85` (85%). Range is (0, 1).
- Trade-off: Higher values give more KV cache capacity (longer contexts, more concurrent requests) but risk OOM from memory fragmentation or other processes. Lower values are safer but reduce throughput.
- Safety margin: The engine magnifies temporary buffers by 2x for safety. The actual memory usage may be slightly higher than the estimate.
- Remediation options when OOM:
- Increase `gpu_memory_utilization` (up to ~0.95)
- Enable tensor parallelism (`--tensor-parallel-shards $NGPU`)
- Use quantization (INT4 reduces model weights by ~4x)
- Reduce `--prefill-chunk-size` (reduces temporary buffer size)
Reasoning
The memory budget formula is:
Available for KV Cache = GPU_size * gpu_memory_utilization - params_bytes - temp_buffer_bytes * 2 - kv_aux_workspace - model_workspace - logit_processor_workspace
The per-token KV cache cost is: `head_dim * num_kv_heads * (num_layers / pipeline_parallel_stages) * 4 + 1.25` bytes.
The logit processor workspace scales with vocabulary size: `max_num_sequence * vocab_size * 16.125` bytes — meaning large-vocabulary models consume significantly more workspace memory.
// From config.cc:714-721
int64_t model_max_total_sequence_length =
static_cast<int>((gpu_size_bytes * gpu_memory_utilization
- params_bytes
- temp_buffer_bytes
- kv_aux_workspace_bytes
- model_workspace_bytes
- logit_processor_workspace_bytes) /
kv_bytes_per_token);
// Temp buffer safety margin from config.cc:849
temp_buffer_bytes *= 2;