Heuristic:Mlc ai Mlc llm Metal KV Cache Capacity Limit
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Memory_Management, Mobile |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Apple Metal backend hard caps KV cache capacity at 32768 tokens to work around severe performance degradation with large GPU buffers.
Description
The Metal GPU runtime experiences severe performance issues when allocating very large contiguous GPU buffers. To prevent this, MLC-LLM applies a hard cap of 32768 tokens to the maximum total sequence length on Metal devices, regardless of the available GPU memory. This affects both macOS (Apple Silicon) and iOS deployments. The cap is applied after the normal memory budget calculation, meaning even if a device has sufficient memory for a longer context, the limit is still enforced.
Usage
Be aware of this limitation when deploying models requiring long context windows (>32K tokens) on Apple hardware. If long-context support is critical, CUDA-based deployment is the better choice.
The Insight (Rule of Thumb)
- Action: Accept the 32768-token KV cache capacity limit on Metal devices, or switch to a CUDA-based deployment for longer contexts.
- Value: Hard cap of 32768 tokens maximum total sequence length.
- Trade-off: Prevents severe performance degradation on Metal at the cost of reduced context window capacity.
- Workaround: For longer contexts, use CUDA deployment or quantization to reduce per-token KV cache size, which increases the useful sequence length within the cap.
Reasoning
Apple's Metal GPU runtime has known performance issues with large buffer allocations. When the KV cache exceeds a certain size, memory management overhead (likely related to Metal's unified memory architecture and buffer paging) causes significant slowdowns. The 32768-token cap was determined empirically as a safe limit that avoids these performance cliffs while still supporting most practical use cases (32K is a common context window for many models).
// From config.cc:746-751
if (device.device_type == DLDeviceType::kDLMetal) {
// NOTE: Metal runtime has severe performance issues with large buffers.
// To work around the issue, we limit the KV cache capacity to 32768.
model_max_total_sequence_length =
std::min(model_max_total_sequence_length, static_cast<int64_t>(32768));
}