Principle:Mlc ai Mlc llm Advanced Serving Features
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Model_Serving, Systems_Engineering |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Advanced serving features are inference-time optimizations -- including speculative decoding, prefix caching, and hybrid prefill modes -- that accelerate large language model serving beyond basic autoregressive token generation.
Description
Basic LLM serving generates one token at a time in an autoregressive loop: each token requires a full forward pass through the model, and the KV cache grows by one entry per step. While this approach is functionally correct, it leaves significant performance on the table. Advanced serving features attack different bottlenecks in this pipeline:
Speculative Decoding
Speculative decoding reduces the effective number of full-model forward passes by using a cheaper draft mechanism to propose multiple candidate tokens, which the full model then verifies in a single batched pass.
Small Draft Model: A smaller, faster model generates a sequence of draft tokens. The main model scores all draft tokens in parallel and accepts a prefix of tokens that match its own distribution (via rejection sampling). The expected number of accepted tokens per verification step is typically 2-4, yielding a proportional speedup.
EAGLE: Rather than a separate draft model, EAGLE uses a lightweight autoregressive head on top of the main model's hidden states. It predicts future token features directly, avoiding the need for a separate model's vocabulary projection and embedding lookup.
Medusa: Medusa augments the model with multiple parallel decoding heads that each predict a different future token position. The predictions form a tree of candidate continuations, which the main model verifies in a single pass using tree attention.
The configuration parameters spec_draft_length and spec_tree_width control the depth and breadth of speculation. When spec_draft_length is 0, the engine adaptively adjusts the draft length based on observed acceptance rates.
Prefix Caching
Many LLM serving workloads involve repeated prompt prefixes: system prompts, few-shot examples, or shared conversation context. Prefix caching stores the KV cache entries for previously computed prefixes and reuses them for new requests that share the same prefix.
Radix Tree Implementation: The radix tree organizes cached token sequences in a trie-like structure where each edge represents a subsequence of tokens. When a new request arrives, the engine traverses the tree to find the longest matching prefix, reusing the corresponding KV cache pages and skipping redundant prefill computation.
The prefix_cache_mode parameter enables or disables this feature, and prefix_cache_max_num_recycling_seqs controls how many evicted sequences retain their KV cache entries for potential reuse.
Hybrid Prefill (Split-Fuse)
In standard chunked prefill, prefill and decode phases alternate: the engine processes a chunk of prefill tokens, then performs decode steps, then returns to prefill. This creates bubbles where either prefill or decode hardware is idle.
Hybrid prefill (split-fuse) merges prefill and decode operations into a single batch. Decode requests are converted into single-token "prefill" operations and fused with pending prefill chunks. This eliminates phase-switching overhead and improves GPU utilization, particularly under mixed workloads with both new and ongoing requests.
The prefill_mode parameter selects between "chunked" (basic alternating) and "hybrid" (split-fuse) strategies.
Usage
These features are configured via the EngineConfig dataclass and take effect at engine initialization:
- Speculative Decoding: Enable when per-token latency is the primary bottleneck and a suitable draft model or head is available. Most effective for long generation sequences where the amortized verification cost is low.
- Prefix Caching: Enable when the workload has significant prompt prefix overlap (e.g., chat applications with system prompts, RAG pipelines with shared document context). The radix tree mode is recommended as the default.
- Hybrid Prefill: Enable (default) for production workloads with mixed prefill and decode traffic. The performance benefit scales with request concurrency.
Theoretical Basis
Speculative Decoding Acceptance Rate
Given a draft model distribution q(x) and a target model distribution p(x), the acceptance probability for each draft token is:
P(accept) = sum_x min(p(x), q(x))
The expected number of accepted tokens from a draft of length K follows a geometric-like distribution. The expected speedup is approximately:
Speedup = E[accepted_tokens + 1] / (cost_draft_K + cost_verify_1)
where cost_draft_K is the cost of generating K draft tokens and cost_verify_1 is the cost of one batched verification pass.
Prefix Cache Hit Analysis
For a workload with N requests sharing a common prefix of length L, the compute savings are:
Savings = (N - 1) * L * cost_per_token_prefill
The radix tree lookup has complexity O(L) in the prefix length, which is negligible compared to the saved prefill computation.
Engine Initialization Flow
function initialize_engine(kind, model, device, config):
1. Validate config consistency
2. Parse model paths, resolve or JIT-compile model libraries
3. Load model configs and conversation templates
4. Initialize engine state (trace recorder, stream callbacks)
5. Create threaded C++ engine via TVM FFI
6. Initialize tokenizer
7. Start background inference loop thread
8. Start background stream-back loop thread
9. Reload engine with finalized config (JSON-serialized)
10. Query completed engine config for actual memory allocation