Implementation:Mlc ai Mlc llm Radix Tree Header
Overview
The file cpp/serve/radix_tree.h declares the PagedRadixTree data structure used in the MLC-LLM serving engine. A paged radix tree (also known as a paged trie) stores token sequences in a compressed tree structure where common prefixes are shared between sequences. This enables efficient prefix matching for KV cache reuse, reducing redundant prefill computation when multiple requests share similar input prompts.
File Location
cpp/serve/radix_tree.h
Dependencies
| Header | Purpose |
|---|---|
tvm/ffi/container/shape.h |
TVM shape container types |
tvm/ffi/reflection/registry.h |
TVM object reflection and registration |
tvm/runtime/int_tuple.h |
IntTuple type for returning token sequences
|
tvm/runtime/object.h |
TVM Object base class |
<unordered_map>, <unordered_set> |
Hash containers for internal bookkeeping |
Namespace
All types are defined in mlc::llm::serve.
Class: PagedRadixTreeObj
Abstract base class inheriting from tvm::runtime::Object. Defines the complete interface for paged radix tree operations.
Method: HasSequence
virtual bool HasSequence(int64_t seq_id) = 0;
Checks whether a sequence with the given ID exists in the tree.
Method: GetSequence
virtual IntTuple GetSequence(int64_t seq_id) = 0;
Returns all tokens in the specified sequence as an IntTuple. Throws an error if the sequence ID is invalid.
Method: MatchPrefix
virtual std::pair<size_t, std::vector<int64_t>> MatchPrefix(
const std::vector<int32_t>& tokens) = 0;
Finds all sequences sharing the longest common prefix with the given token sequence. Returns a pair containing:
- The length of the matched prefix
- A vector of sequence IDs that share this prefix
This is the core lookup operation for prefix cache matching.
Method: GetSequenceLength
virtual size_t GetSequenceLength(int64_t seq_id) = 0;
Returns the total length (in tokens) of the specified sequence.
Method: ForkSequence
virtual void ForkSequence(int64_t seq_id, int64_t parent_seq_id, size_t forked_offset) = 0;
Creates a new sequence by forking from a parent sequence at a specified offset. The valid offset range is [1, parent_length]. When the offset equals the parent's full length, the new sequence is a complete copy of the parent. This enables tree-based attention patterns for parallel generation (e.g., beam search or parallel sampling with n > 1).
Method: AddSequence
virtual void AddSequence(int64_t seq_id) = 0;
Adds an empty sequence at the root of the tree.
Method: ExtendSequence
virtual void ExtendSequence(int64_t seq_id, const std::vector<int32_t>& tokens) = 0;
Appends tokens to an existing sequence.
Method: RollBackSequence
virtual void RollBackSequence(int64_t seq_id, size_t num_tokens) = 0;
Removes the last num_tokens from a sequence. This is used during speculative decoding rollback when draft tokens are rejected.
Method: RemoveSequence
virtual void RemoveSequence(int64_t seq_id) = 0;
Completely removes a sequence from the tree.
Method: FreeCapacity
virtual size_t FreeCapacity() = 0;
Returns the remaining token capacity of the paged radix tree. This indicates how many more tokens can be stored before the tree runs out of allocated pages.
Method: Reset
virtual void Reset() = 0;
Resets the tree to its initial empty state, removing all sequences and freeing all pages.
TVM Object Registration
static constexpr const bool _type_mutable = true;
TVM_FFI_DECLARE_OBJECT_INFO("mlc.serve.PagedRadixTree", PagedRadixTreeObj, Object);
Registered under type key "mlc.serve.PagedRadixTree" and marked as mutable.
Class: PagedRadixTree
The managed reference type for PagedRadixTreeObj.
Factory Method
static PagedRadixTree Create();
Constructs and returns a new paged radix tree instance.
Design Notes
The paged radix tree is the underlying data structure for the PrefixCache. It provides raw sequence storage and prefix matching operations, while the prefix cache layer adds:
- Lifecycle management (recycling, lazy eviction)
- Sliding window attention support
- Integration with the KV cache memory management
The "paged" designation indicates that the tree uses a page-based memory allocation strategy, which aligns with the paged KV cache architecture used by the serving engine for efficient GPU memory management.
See Also
- Prefix Cache Header -- Higher-level cache built on top of this radix tree
- Request Header -- Requests whose token sequences are stored in the tree
- Request State Implementation -- Request states managed alongside cached sequences