Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Mlc ai Mlc llm Metal KV Cache Capacity Limit

From Leeroopedia



Knowledge Sources
Domains Optimization, Memory_Management, Mobile
Last Updated 2026-02-09 19:00 GMT

Overview

Apple Metal backend hard caps KV cache capacity at 32768 tokens to work around severe performance degradation with large GPU buffers.

Description

The Metal GPU runtime experiences severe performance issues when allocating very large contiguous GPU buffers. To prevent this, MLC-LLM applies a hard cap of 32768 tokens to the maximum total sequence length on Metal devices, regardless of the available GPU memory. This affects both macOS (Apple Silicon) and iOS deployments. The cap is applied after the normal memory budget calculation, meaning even if a device has sufficient memory for a longer context, the limit is still enforced.

Usage

Be aware of this limitation when deploying models requiring long context windows (>32K tokens) on Apple hardware. If long-context support is critical, CUDA-based deployment is the better choice.

The Insight (Rule of Thumb)

  • Action: Accept the 32768-token KV cache capacity limit on Metal devices, or switch to a CUDA-based deployment for longer contexts.
  • Value: Hard cap of 32768 tokens maximum total sequence length.
  • Trade-off: Prevents severe performance degradation on Metal at the cost of reduced context window capacity.
  • Workaround: For longer contexts, use CUDA deployment or quantization to reduce per-token KV cache size, which increases the useful sequence length within the cap.

Reasoning

Apple's Metal GPU runtime has known performance issues with large buffer allocations. When the KV cache exceeds a certain size, memory management overhead (likely related to Metal's unified memory architecture and buffer paging) causes significant slowdowns. The 32768-token cap was determined empirically as a safe limit that avoids these performance cliffs while still supporting most practical use cases (32K is a common context window for many models).

// From config.cc:746-751
if (device.device_type == DLDeviceType::kDLMetal) {
    // NOTE: Metal runtime has severe performance issues with large buffers.
    // To work around the issue, we limit the KV cache capacity to 32768.
    model_max_total_sequence_length =
        std::min(model_max_total_sequence_length, static_cast<int64_t>(32768));
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment