Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm Batch Jumpforward

From Leeroopedia
Revision as of 15:48, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Mlc_ai_Mlc_llm_Batch_Jumpforward.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Overview

File: cpp/serve/engine_actions/batch_jumpforward.cc

Purpose: Implements the BatchJumpForwardActionObj engine action, which performs grammar-guided jump-forward decoding for requests in the serving engine's running_queue. When a grammar state matcher can deterministically predict upcoming tokens (e.g., required syntax in structured output), this action skips the normal token-by-token decoding and directly commits those tokens, significantly accelerating generation for constrained outputs.

Namespace: mlc::llm::serve

Class: BatchJumpForwardActionObj

Inherits from EngineActionObj and implements the Step method.

Constructor

explicit BatchJumpForwardActionObj(Array<Model> models, Tokenizer tokenizer,
                                   Optional<EventTraceRecorder> trace_recorder)

Accepts an array of models, a tokenizer (required for retokenization during jump-forward), and an optional event trace recorder for profiling.

Step Method

Array<Request> Step(EngineState estate) final;

The main execution method, invoked each engine iteration. The logic proceeds as follows:

  1. Early exit: Returns immediately if there are multiple models (jump-forward only supports single-model mode) or if the running queue is empty.
  2. Memory check and preemption: Iterates over running request state entries and preempts low-priority requests if insufficient memory pages are available, using PreemptLastRunningRequestStateEntry. Also attempts to free prefix cache memory before preempting.
  3. Jump-forward execution: For each eligible request state entry:
    • Checks eligibility via CanJumpForward.
    • Queries the grammar matcher for a deterministic jump-forward string.
    • Retokenizes the new string combined with recent context tokens.
    • Handles rollback of conflicting tokens in the stream output, model state, and KV cache.
    • Commits the new tokens to the model state.
    • Sets the require_retokenization_in_next_decode flag.
    • Updates jump-forward and completion token metrics.
  4. Timing: Records the total jump-forward time in the engine metrics.

Private Methods

CheckMemForJumpForward

bool CheckMemForJumpForward(int num_rsentries);

Determines whether jump-forward decoding can proceed without exceeding the memory limit. Uses a constant MAX_AVG_JUMPFORWARD_PAGES_PER_REQUEST = 10 and checks available pages from the first model.

CanJumpForward

bool CanJumpForward(const RequestStateEntry& rsentry);

Returns true only if all three conditions are met:

  • The request's grammar execution mode is kJumpForward.
  • Log probabilities are not requested (logprobs and jump-forward are incompatible since intermediate token probabilities are skipped).
  • A grammar matcher is defined for the request's model state.

RetokenizeWithNewString

std::tuple<int, std::vector<int32_t>, std::string> RetokenizeWithNewString(
    RequestModelState mstate, const std::string& new_string, int max_rollback_tokens);

Handles the retokenization problem that arises when appending a jump-forward string to existing token context. The algorithm:

  1. Retrieves the last max_rollback_tokens committed tokens and decodes them back to a string.
  2. Tokenizes the concatenation of the past string and the new jump-forward string.
  3. Removes the last token if it is a prefix of another token (to avoid tokens that would be rolled back in the next decode, which would disturb the probability distribution).
  4. Computes the divergence point between the old and new token sequences.
  5. Returns a tuple of (rollback_count, new_tokens, delta_string).

HandleRollback

void HandleRollback(const RequestStateEntry& rsentry, RequestModelState mstate, int rollback_cnt,
                    const std::vector<int32_t>& new_tokens, const std::string& new_string);

Manages the three-part rollback required when retokenization produces different tokens than the existing committed sequence:

  1. Stream output rollback: If the rollback extends past the callback boundary, accumulates text into extra_prefix_string and adjusts next_callback_token_pos.
  2. Model state rollback: Calls mstate->RollbackTokens(rollback_cnt) to undo committed tokens.
  3. KV cache rollback: Pops entries from the KV cache via models_[0]->PopNFromKVCache, adjusted by num_tokens_for_next_decode.

Member Variables

Member Type Description
models_ Array<Model> The models (only first model is used; multi-model aborts early)
tokenizer_ Tokenizer Tokenizer for retokenization during jump-forward
trace_recorder_ Optional<EventTraceRecorder> Optional event trace recorder
MAX_ROLLBACK_TOKENS_ const int Maximum tokens to consider for rollback (hardcoded to 10)

Factory Function

EngineAction EngineAction::BatchJumpForward(Array<Model> models, Tokenizer tokenizer,
                                            Optional<EventTraceRecorder> trace_recorder);

Static factory method on EngineAction that constructs a BatchJumpForwardActionObj wrapped in a TVM object reference.

Design Notes

  • Jump-forward decoding is incompatible with log probability output because intermediate token probabilities are unavailable when tokens are committed directly from grammar constraints.
  • The retokenization step is critical because tokenization is context-dependent -- appending a new string may change how preceding tokens should be encoded.
  • The MAX_ROLLBACK_TOKENS_ constant of 10 bounds the retokenization window, balancing correctness with performance.
  • The action returns an empty Array<Request> since it does not generate streaming output directly; output is handled by subsequent actions.

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment