Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm Prefix Cache

From Leeroopedia


Knowledge Sources
Domains LLM Serving, KV Cache Management, Memory Optimization
Last Updated 2026-02-09 19:00 GMT

Overview

PrefixCache implements a radix-tree-based prefix caching system for the MLC LLM serving engine that enables reuse of KV cache data across requests sharing common token prefixes, reducing redundant prefill computation.

Description

The prefix_cache.cc file provides two implementations of the PrefixCacheObj interface in the mlc::llm::serve namespace: a full-featured PrefixCacheImpl using a paged radix tree, and a no-op NoPrefixCache for deployments where prefix caching is disabled.

PrefixCacheImpl manages sequence lifecycle through two states tracked via SequenceState:

  • kActive: The sequence is currently being used for inference. Active sequences can be forked but not reused.
  • kRecycling: The sequence has completed but its KV cache data is retained for potential future reuse. Recycling sequences can be either reused (promoted back to active) or evicted when memory is needed.

The core data structure is a PagedRadixTree that stores token sequences and supports efficient prefix matching via MatchPrefix. The class maintains bidirectional LRU maps (recycling_seq_lrus_ and reversed_recycling_seq_lrus_) for eviction ordering, and per-sequence sliding window information.

InsertSequence is the primary method for adding a new sequence to the cache. It commits any pending extensions, strips the last token (since it has not been prefilled yet), and then attempts prefix matching. The matching logic differs based on whether sliding window attention is enabled:

  • With sliding window: Only exact matches against recycling sequences with identical sliding window parameters are accepted, since rolling back is not safe under sliding window constraints.
  • Without sliding window: The method greedily selects the shortest recycling sequence whose length overlaps at least 90% with the matched prefix. If a recycling sequence is longer than the match, it rolls back the excess trailing tokens. If no suitable recycling sequence is found, it attempts to fork from the longest matching active sequence that does not use sliding window. As a final fallback, a new sequence is created.

The method returns a PrefixCacheMatchedResult containing four fields: prefilled_offset (how many tokens are already prefilled), forked_seq_id (if forked from an active sequence), reused_seq_id (if a recycling sequence was reused), and reused_seq_pop_last_tokens (how many tokens were rolled back from the reused sequence).

ExtendSequence lazily accumulates new tokens for a sequence without immediately updating the radix tree. The actual commitment happens in CommitSequenceExtention, which is designed to be called at strategic points to overlap radix tree updates with GPU computation.

RecycleSequence transitions an active sequence to the recycling state. If the maximum number of recycling sequences is reached, the oldest one (by LRU counter) is evicted first. If lazy is false or the maximum recycling count is zero, the sequence is removed immediately. A remove_callback is invoked upon sequence removal, enabling the KV cache and ID manager to clean up their state.

TryFreeMemory evicts the oldest recycling sequence to reclaim KV cache pages.

NoPrefixCache is a trivial implementation where all operations are no-ops or return empty results. InsertSequence always returns a result indicating a new sequence with no prefix match, HasSequence always returns false, and Mode() returns PrefixCacheMode::kDisable.

Usage

Use PrefixCache in the MLC LLM serving engine to reduce redundant prefill computation when multiple requests share common prefixes (e.g., system prompts, few-shot examples). Create it via PrefixCache::CreateRadixPrefixCache for the full implementation or PrefixCache::CreateNoPrefixCache to disable caching.

Code Reference

Source Location

Signature

class PrefixCacheImpl : public PrefixCacheObj {
public:
  explicit PrefixCacheImpl(size_t max_num_recycling_seqs,
                           PrefixCacheRemoveCallback remove_callback);

  PrefixCacheMatchedResult InsertSequence(
      int64_t seq_id, std::vector<int32_t> tokens,
      int sliding_window_size, int attention_sink_size) final;
  void ExtendSequence(int64_t seq_id,
                      const std::vector<int32_t>& tokens) final;
  void CommitSequenceExtention() final;
  void RollBackSequence(int64_t seq_id, size_t num_tokens) final;
  void RecycleSequence(int64_t seq_id, bool lazy = true) final;
  bool TryFreeMemory() final;
  bool HasSequence(int64_t seq_id) final;
  void Reset() final;
  PrefixCacheMode Mode() final;
};

class NoPrefixCache : public PrefixCacheObj {
  // All methods are no-ops or return empty/false results
  PrefixCacheMode Mode() final; // Returns PrefixCacheMode::kDisable
};

// Factory methods
static PrefixCache PrefixCache::CreateRadixPrefixCache(
    size_t max_num_recycling_seqs,
    PrefixCacheRemoveCallback remove_callback);
static PrefixCache PrefixCache::CreateNoPrefixCache();

Import

#include "prefix_cache.h"

I/O Contract

Inputs

Name Type Required Description
max_num_recycling_seqs size_t Yes Maximum number of recycling sequences to retain in the cache
remove_callback PrefixCacheRemoveCallback No Optional callback invoked when a sequence is fully removed (for KV cache cleanup)
seq_id int64_t Yes Unique sequence identifier
tokens std::vector<int32_t> Yes (Insert/Extend) Token IDs for the sequence or extension suffix
sliding_window_size int Yes (Insert) Sliding window size (-1 if disabled)
attention_sink_size int Yes (Insert) Number of attention sink tokens (0 by default)
num_tokens size_t Yes (RollBack) Number of tokens to roll back from the end
lazy bool No Whether to recycle lazily (true) or remove immediately (false)

Outputs

Name Type Description
PrefixCacheMatchedResult struct Contains prefilled_offset, forked_seq_id, reused_seq_id, and reused_seq_pop_last_tokens
bool bool HasSequence returns existence check; TryFreeMemory returns whether memory was freed
PrefixCacheMode enum Mode() returns kRadix or kDisable

Usage Examples

// Create a prefix cache with up to 64 recycling sequences
auto remove_cb = [&](int64_t seq_id) {
  model->RemoveSequence(seq_id);
  id_manager.RecycleId(seq_id);
};
PrefixCache cache = PrefixCache::CreateRadixPrefixCache(64, remove_cb);

// Insert a new sequence and check for prefix matches
std::vector<int32_t> tokens = tokenizer.Encode(prompt);
PrefixCacheMatchedResult result = cache->InsertSequence(
    seq_id, tokens, /*sliding_window_size=*/-1, /*attention_sink_size=*/0);

if (result.prefilled_offset > 0) {
  if (result.forked_seq_id != -1) {
    // Fork from an active sequence
    model->ForkSequence(result.forked_seq_id, seq_id,
                        result.prefilled_offset);
  } else if (result.reused_seq_id != -1) {
    // Reuse a recycling sequence
    // Reassign internal ID and pop excess tokens if needed
  }
  // Skip prefilling the first result.prefilled_offset tokens
}

// When a request completes, recycle the sequence
cache->RecycleSequence(seq_id, /*lazy=*/true);

// When memory is low, free the oldest recycling sequence
if (cache->TryFreeMemory()) {
  // Memory was freed successfully
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment