Heuristic:Mlc ai Mlc llm GPU Memory Budget Tuning

Knowledge Sources	MLC-LLM
Domains	Optimization, Memory_Management
Last Updated	2026-02-09 19:00 GMT

Overview

Memory budget tuning guide for configuring `gpu_memory_utilization` and understanding the automatic KV cache capacity inference based on available GPU memory.

Description

MLC-LLM automatically infers the maximum KV cache token capacity based on the available GPU memory, model parameter size, temporary buffer requirements, and the `gpu_memory_utilization` fraction. The engine estimates memory for model weights, KV cache (with per-token byte cost), auxiliary workspace, model workspace, and logit processor workspace. The temporary buffer size is magnified by 2x for safety. When the inferred capacity is insufficient, the engine provides detailed diagnostic messages suggesting remediation steps.

Usage

Apply this heuristic when encountering Insufficient GPU memory errors, when trying to maximize batch size or context length, or when tuning serving throughput. The key lever is the `gpu_memory_utilization` parameter (defaults to 0.85).

The Insight (Rule of Thumb)

Action: Set `gpu_memory_utilization` to control the fraction of GPU memory used by MLC-LLM.
Value: Default is `0.85` (85%). Range is (0, 1).
Trade-off: Higher values give more KV cache capacity (longer contexts, more concurrent requests) but risk OOM from memory fragmentation or other processes. Lower values are safer but reduce throughput.
Safety margin: The engine magnifies temporary buffers by 2x for safety. The actual memory usage may be slightly higher than the estimate.
Remediation options when OOM:
- Increase `gpu_memory_utilization` (up to ~0.95)
- Enable tensor parallelism (`--tensor-parallel-shards $NGPU`)
- Use quantization (INT4 reduces model weights by ~4x)
- Reduce `--prefill-chunk-size` (reduces temporary buffer size)

Reasoning

The memory budget formula is:

Available for KV Cache = GPU_size * gpu_memory_utilization - params_bytes - temp_buffer_bytes * 2 - kv_aux_workspace - model_workspace - logit_processor_workspace

The per-token KV cache cost is: `head_dim * num_kv_heads * (num_layers / pipeline_parallel_stages) * 4 + 1.25` bytes.

The logit processor workspace scales with vocabulary size: `max_num_sequence * vocab_size * 16.125` bytes — meaning large-vocabulary models consume significantly more workspace memory.

// From config.cc:714-721
int64_t model_max_total_sequence_length =
    static_cast<int>((gpu_size_bytes * gpu_memory_utilization
                      - params_bytes
                      - temp_buffer_bytes
                      - kv_aux_workspace_bytes
                      - model_workspace_bytes
                      - logit_processor_workspace_bytes) /
                     kv_bytes_per_token);

// Temp buffer safety margin from config.cc:849
temp_buffer_bytes *= 2;

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment