Implementation:Mlc ai Mlc llm Radix Tree Header

Overview

The file cpp/serve/radix_tree.h declares the PagedRadixTree data structure used in the MLC-LLM serving engine. A paged radix tree (also known as a paged trie) stores token sequences in a compressed tree structure where common prefixes are shared between sequences. This enables efficient prefix matching for KV cache reuse, reducing redundant prefill computation when multiple requests share similar input prompts.

File Location

cpp/serve/radix_tree.h

Dependencies

Header	Purpose
`tvm/ffi/container/shape.h`	TVM shape container types
`tvm/ffi/reflection/registry.h`	TVM object reflection and registration
`tvm/runtime/int_tuple.h`	`IntTuple` type for returning token sequences
`tvm/runtime/object.h`	TVM Object base class
`<unordered_map>`, `<unordered_set>`	Hash containers for internal bookkeeping

Namespace

All types are defined in mlc::llm::serve.

Class: PagedRadixTreeObj

Abstract base class inheriting from tvm::runtime::Object. Defines the complete interface for paged radix tree operations.

Method: HasSequence

virtual bool HasSequence(int64_t seq_id) = 0;

Checks whether a sequence with the given ID exists in the tree.

Method: GetSequence

virtual IntTuple GetSequence(int64_t seq_id) = 0;

Returns all tokens in the specified sequence as an IntTuple. Throws an error if the sequence ID is invalid.

Method: MatchPrefix

virtual std::pair<size_t, std::vector<int64_t>> MatchPrefix(
    const std::vector<int32_t>& tokens) = 0;

Finds all sequences sharing the longest common prefix with the given token sequence. Returns a pair containing:

The length of the matched prefix
A vector of sequence IDs that share this prefix

This is the core lookup operation for prefix cache matching.

Method: GetSequenceLength

virtual size_t GetSequenceLength(int64_t seq_id) = 0;

Returns the total length (in tokens) of the specified sequence.

Method: ForkSequence

virtual void ForkSequence(int64_t seq_id, int64_t parent_seq_id, size_t forked_offset) = 0;

Creates a new sequence by forking from a parent sequence at a specified offset. The valid offset range is [1, parent_length]. When the offset equals the parent's full length, the new sequence is a complete copy of the parent. This enables tree-based attention patterns for parallel generation (e.g., beam search or parallel sampling with n > 1).

Method: AddSequence

virtual void AddSequence(int64_t seq_id) = 0;

Adds an empty sequence at the root of the tree.

Method: ExtendSequence

virtual void ExtendSequence(int64_t seq_id, const std::vector<int32_t>& tokens) = 0;

Appends tokens to an existing sequence.

Method: RollBackSequence

virtual void RollBackSequence(int64_t seq_id, size_t num_tokens) = 0;

Removes the last num_tokens from a sequence. This is used during speculative decoding rollback when draft tokens are rejected.

Method: RemoveSequence

virtual void RemoveSequence(int64_t seq_id) = 0;

Completely removes a sequence from the tree.

Method: FreeCapacity

virtual size_t FreeCapacity() = 0;

Returns the remaining token capacity of the paged radix tree. This indicates how many more tokens can be stored before the tree runs out of allocated pages.

Method: Reset

virtual void Reset() = 0;

Resets the tree to its initial empty state, removing all sequences and freeing all pages.

TVM Object Registration

static constexpr const bool _type_mutable = true;
TVM_FFI_DECLARE_OBJECT_INFO("mlc.serve.PagedRadixTree", PagedRadixTreeObj, Object);

Registered under type key "mlc.serve.PagedRadixTree" and marked as mutable.

Class: PagedRadixTree

The managed reference type for PagedRadixTreeObj.

Factory Method

static PagedRadixTree Create();

Constructs and returns a new paged radix tree instance.

Design Notes

The paged radix tree is the underlying data structure for the PrefixCache. It provides raw sequence storage and prefix matching operations, while the prefix cache layer adds:

Lifecycle management (recycling, lazy eviction)
Sliding window attention support
Integration with the KV cache memory management

The "paged" designation indicates that the tree uses a page-based memory allocation strategy, which aligns with the paged KV cache architecture used by the serving engine for efficient GPU memory management.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment