Heuristic:Mlc ai Mlc llm FlashInfer KV Cache Fallback

Knowledge Sources	MLC-LLM
Domains	Optimization, Attention_Mechanism
Last Updated	2026-02-09 19:00 GMT

Overview

Decision logic for when the FlashInfer-based PagedKVCache is used versus the TIR-based fallback, based on target platform, data type, and RoPE configuration.

Description

MLC-LLM compiles two KV cache implementations: a FlashInfer-based version and a TIR-based version. At runtime, the engine selects FlashInfer when available, falling back to TIR otherwise. The compilation pass `DispatchKVCacheCreation` determines whether FlashInfer kernels can be generated based on several conditions. If any condition fails, the FlashInfer cache is silently excluded and only the TIR implementation is available. Understanding these conditions prevents confusion when performance differs across configurations.

Usage

Apply this heuristic when debugging performance differences between model configurations, or when deciding whether to invest in an Ampere+ GPU for a deployment. FlashInfer attention provides significant speedups (2-3x) but has strict requirements.

The Insight (Rule of Thumb)

FlashInfer PagedKVCache is only generated when ALL of these conditions are met:

FlashInfer enabled: The compilation was done with `flashinfer=True` (O2 or O3)
CUDA target: The compilation target is `cuda` (not Metal, OpenCL, Vulkan, or CPU)
Supported dtype: The KV cache data type is `float16` or `bfloat16` (not float32 or int8)
RoPE compatibility: If using inline RoPE mode, `rotary_dim` must equal `qk_head_dim` and `qk_head_dim` must equal `v_head_dim`

If any condition fails, the model silently falls back to TIR-based attention with no error message — only an info-level log.

Trade-off: FlashInfer provides optimized GPU-native attention but is less portable. TIR attention works everywhere but is slower on CUDA.

Reasoning

FlashInfer is a specialized CUDA library for attention computation that uses GPU-specific instructions not available on other backends. Its KV cache implementation has strict type requirements (FP16/BF16 only) because the CUDA kernels use specialized half-precision tensor core instructions. The RoPE constraint exists because FlashInfer's inline RoPE implementation assumes uniform head dimensions.

The fallback is graceful — errors during FlashInfer KV cache creation are caught and logged at info level, and the model continues with TIR-based KV cache. This means a misconfigured environment silently loses performance rather than failing.

# From dispatch_kv_cache_creation.py:187-199
def create_flashinfer_paged_kv_cache(self, bb, kwargs):
    # Filter the cases which FlashInfer does not support.
    if (
        not self.flashinfer
        or self.target.kind.name != "cuda"
        or str(kwargs["dtype"]) not in ["float16", "bfloat16"]
        or (
            kwargs["rope_mode"] == RopeMode.INLINE
            and (
                kwargs["rotary_dim"] != kwargs["qk_head_dim"]
                or kwargs["qk_head_dim"] != kwargs["v_head_dim"]
            )
        )
    ):
        return []

# Error handling with graceful fallback - dispatch_kv_cache_creation.py:229-235
except Exception as e:
    logger.info(
        "Error caught when creating FlashInfer PagedKVCache: %s\n"
        "The model will fallback to TIR-based KV cache.",
        e,
    )
    return []

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment