Implementation:Mlc ai Mlc llm Batch Jumpforward
Overview
File: cpp/serve/engine_actions/batch_jumpforward.cc
Purpose: Implements the BatchJumpForwardActionObj engine action, which performs grammar-guided jump-forward decoding for requests in the serving engine's running_queue. When a grammar state matcher can deterministically predict upcoming tokens (e.g., required syntax in structured output), this action skips the normal token-by-token decoding and directly commits those tokens, significantly accelerating generation for constrained outputs.
Namespace: mlc::llm::serve
Class: BatchJumpForwardActionObj
Inherits from EngineActionObj and implements the Step method.
Constructor
explicit BatchJumpForwardActionObj(Array<Model> models, Tokenizer tokenizer,
Optional<EventTraceRecorder> trace_recorder)
Accepts an array of models, a tokenizer (required for retokenization during jump-forward), and an optional event trace recorder for profiling.
Step Method
Array<Request> Step(EngineState estate) final;
The main execution method, invoked each engine iteration. The logic proceeds as follows:
- Early exit: Returns immediately if there are multiple models (jump-forward only supports single-model mode) or if the running queue is empty.
- Memory check and preemption: Iterates over running request state entries and preempts low-priority requests if insufficient memory pages are available, using
PreemptLastRunningRequestStateEntry. Also attempts to free prefix cache memory before preempting. - Jump-forward execution: For each eligible request state entry:
- Checks eligibility via
CanJumpForward. - Queries the grammar matcher for a deterministic jump-forward string.
- Retokenizes the new string combined with recent context tokens.
- Handles rollback of conflicting tokens in the stream output, model state, and KV cache.
- Commits the new tokens to the model state.
- Sets the
require_retokenization_in_next_decodeflag. - Updates jump-forward and completion token metrics.
- Checks eligibility via
- Timing: Records the total jump-forward time in the engine metrics.
Private Methods
CheckMemForJumpForward
bool CheckMemForJumpForward(int num_rsentries);
Determines whether jump-forward decoding can proceed without exceeding the memory limit. Uses a constant MAX_AVG_JUMPFORWARD_PAGES_PER_REQUEST = 10 and checks available pages from the first model.
CanJumpForward
bool CanJumpForward(const RequestStateEntry& rsentry);
Returns true only if all three conditions are met:
- The request's grammar execution mode is
kJumpForward. - Log probabilities are not requested (logprobs and jump-forward are incompatible since intermediate token probabilities are skipped).
- A grammar matcher is defined for the request's model state.
RetokenizeWithNewString
std::tuple<int, std::vector<int32_t>, std::string> RetokenizeWithNewString(
RequestModelState mstate, const std::string& new_string, int max_rollback_tokens);
Handles the retokenization problem that arises when appending a jump-forward string to existing token context. The algorithm:
- Retrieves the last
max_rollback_tokenscommitted tokens and decodes them back to a string. - Tokenizes the concatenation of the past string and the new jump-forward string.
- Removes the last token if it is a prefix of another token (to avoid tokens that would be rolled back in the next decode, which would disturb the probability distribution).
- Computes the divergence point between the old and new token sequences.
- Returns a tuple of
(rollback_count, new_tokens, delta_string).
HandleRollback
void HandleRollback(const RequestStateEntry& rsentry, RequestModelState mstate, int rollback_cnt,
const std::vector<int32_t>& new_tokens, const std::string& new_string);
Manages the three-part rollback required when retokenization produces different tokens than the existing committed sequence:
- Stream output rollback: If the rollback extends past the callback boundary, accumulates text into
extra_prefix_stringand adjustsnext_callback_token_pos. - Model state rollback: Calls
mstate->RollbackTokens(rollback_cnt)to undo committed tokens. - KV cache rollback: Pops entries from the KV cache via
models_[0]->PopNFromKVCache, adjusted bynum_tokens_for_next_decode.
Member Variables
| Member | Type | Description |
|---|---|---|
models_ |
Array<Model> |
The models (only first model is used; multi-model aborts early) |
tokenizer_ |
Tokenizer |
Tokenizer for retokenization during jump-forward |
trace_recorder_ |
Optional<EventTraceRecorder> |
Optional event trace recorder |
MAX_ROLLBACK_TOKENS_ |
const int |
Maximum tokens to consider for rollback (hardcoded to 10) |
Factory Function
EngineAction EngineAction::BatchJumpForward(Array<Model> models, Tokenizer tokenizer,
Optional<EventTraceRecorder> trace_recorder);
Static factory method on EngineAction that constructs a BatchJumpForwardActionObj wrapped in a TVM object reference.
Design Notes
- Jump-forward decoding is incompatible with log probability output because intermediate token probabilities are unavailable when tokens are committed directly from grammar constraints.
- The retokenization step is critical because tokenization is context-dependent -- appending a new string may change how preceding tokens should be encoded.
- The
MAX_ROLLBACK_TOKENS_constant of 10 bounds the retokenization window, balancing correctness with performance. - The action returns an empty
Array<Request>since it does not generate streaming output directly; output is handled by subsequent actions.