Implementation:Mlc ai Mlc llm Prefix Cache Header

Overview

The file cpp/serve/prefix_cache.h declares the PrefixCache abstraction for the MLC-LLM serving engine. A prefix cache enables KV cache reuse across requests that share common input prefixes, reducing redundant computation during the prefill stage. The design supports two modes: a radix-tree-based prefix cache and a no-op (disabled) cache, both conforming to the same abstract interface.

File Location

cpp/serve/prefix_cache.h

Dependencies

Header	Purpose
`tvm/ffi/container/shape.h`	TVM shape container types
`tvm/ffi/reflection/registry.h`	TVM object reflection registration
`tvm/runtime/object.h`	TVM Object base class
`<functional>`	`std::function` for callback types
`<optional>`	`std::optional` support
`<unordered_map>`, `<unordered_set>`	Hash-based containers
`model.h`	Model interface definitions
`radix_tree.h`	Underlying radix tree data structure
`request_state.h`	Request model state definitions

Namespace

All types are defined in mlc::llm::serve.

Type Alias: PrefixCacheRemoveCallback

using PrefixCacheRemoveCallback = std::function<void(int64_t)>;

A callback function type invoked when a sequence is evicted from the prefix cache. The parameter is the sequence ID being removed. This allows the KV cache manager to free the corresponding memory when eviction occurs.

Class: PrefixCacheMatchedResult

class PrefixCacheMatchedResult {
 public:
  size_t prefilled_offset = 0;
  int64_t forked_seq_id = -1;
  int64_t reused_seq_id = -1;
  size_t reused_seq_pop_last_tokens = 0;
};

Describes how to handle a new sequence given the current cache state. The result indicates one of several strategies for leveraging existing KV cache data:

Field	Type	Description
`prefilled_offset`	`size_t`	The number of prefix tokens already prefilled and available in cache
`forked_seq_id`	`int64_t`	The sequence ID to fork from (`-1` if no fork)
`reused_seq_id`	`int64_t`	A finished sequence ID whose KV data can be reused directly (`-1` if none)
`reused_seq_pop_last_tokens`	`size_t`	Number of trailing tokens to pop from the reused sequence to align with the new request

Class: PrefixCacheObj

Abstract base class inheriting from tvm::runtime::Object that defines the prefix cache interface.

Method: InsertSequence

virtual PrefixCacheMatchedResult InsertSequence(int64_t seq_id, std::vector<int32_t> tokens,
                                                int sliding_window_size = -1,
                                                int attention_sink_size = 0) = 0;

Inserts a new tokenized sequence into the cache and returns a PrefixCacheMatchedResult describing the optimal reuse strategy. Supports optional sliding window attention and attention sink parameters.

Method: ExtendSequence

virtual void ExtendSequence(int64_t seq_id, const std::vector<int32_t>& tokens) = 0;

Extends an existing active sequence with additional tokens. The extension may be cached and lazily committed.

Method: CommitSequenceExtention

virtual void CommitSequenceExtention() = 0;

Commits all pending lazy extensions from previous ExtendSequence calls. This two-phase approach allows batching extension operations.

Method: RollBackSequence

virtual void RollBackSequence(int64_t seq_id, size_t num_tokens) = 0;

Rolls back a sequence by a specified number of tokens, used when speculative decoding rejects draft tokens.

Method: RecycleSequence

virtual void RecycleSequence(int64_t seq_id, bool lazy = true) = 0;

Marks a sequence for recycling. With lazy = true (the default), the sequence remains in cache until memory pressure requires eviction, enabling future reuse. With lazy = false, the sequence is removed immediately.

Method: TryFreeMemory

virtual bool TryFreeMemory() = 0;

Attempts to free memory by evicting the oldest recycled sequence. Returns true if a sequence was successfully removed.

Method: HasSequence

virtual bool HasSequence(int64_t seq_id) = 0;

Checks whether a sequence with the given ID exists in the cache.

Method: Reset

virtual void Reset() = 0;

Resets the prefix cache to its initial empty state.

Method: Mode

virtual PrefixCacheMode Mode() = 0;

Returns the current prefix cache mode (radix or disabled).

TVM Object Registration

static constexpr const bool _type_mutable = true;
TVM_FFI_DECLARE_OBJECT_INFO("mlc.serve.PrefixCache", PrefixCacheObj, Object);

Registered under type key "mlc.serve.PrefixCache" and marked as mutable.

Class: PrefixCache

The managed reference type for PrefixCacheObj, providing two factory methods:

Factory: CreateRadixPrefixCache

static PrefixCache CreateRadixPrefixCache(size_t max_recycling_seqs,
                                          PrefixCacheRemoveCallback remove_callback = nullptr);

Creates a radix-tree-backed prefix cache.

max_recycling_seqs -- Maximum number of completed sequences to retain for potential reuse.
remove_callback -- Optional callback invoked when a sequence is evicted from the cache.

Factory: CreateNoPrefixCache

static PrefixCache CreateNoPrefixCache();

Creates a no-op prefix cache that does not perform any prefix sharing. This is used when prefix caching is disabled in the engine configuration.

Relationship to Radix Tree

The prefix cache builds on the PagedRadixTree data structure, which provides the underlying efficient prefix matching and sequence storage. The prefix cache adds lifecycle management (recycling, eviction, lazy extension) on top of the raw radix tree operations.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment