Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm Engine State

From Leeroopedia


Overview

File: cpp/serve/engine_state.h

Purpose: Defines the core runtime state of the MLC LLM serving engine. The EngineStateObj class holds all mutable state required during inference -- including the running and waiting request queues, request states, internal ID management, runtime metrics, prefix cache, and speculative decoding configuration. This state object is passed to all engine actions and updated throughout the serving pipeline.

Namespace: mlc::llm::serve

Type Alias

typedef TypedFunction<void(Array<RequestStreamOutput>)> FRequestStreamCallback;

Defines the callback function type for streaming request outputs. The callback receives an array of RequestStreamOutput objects and is invoked during action post-processing to deliver generated tokens to the caller.

Struct: EngineInternalIDManager

struct EngineInternalIDManager {
  std::vector<int64_t> available_ids;
  int64_t id_cnt = 0;

  int64_t GetNewId();
  void RecycleId(int64_t id);
  void Reset();
};

Manages the internal integer IDs assigned to requests within the engine. This is separate from the user-facing string request IDs.

Method Description
GetNewId() Returns an unused ID. First attempts to reuse a recycled ID from available_ids; otherwise increments id_cnt and returns a new ID.
RecycleId(int64_t id) Returns an ID to the pool of available IDs for reuse.
Reset() Clears all available IDs and resets the counter to zero.

The ID recycling mechanism ensures efficient reuse of internal IDs as requests complete, preventing unbounded ID growth during long-running serving sessions.

Struct: ActionPostProcessWorkspace

struct ActionPostProcessWorkspace {
  std::vector<RequestStateEntry> finished_rsentries;
  Array<RequestStreamOutput> callback_delta_outputs;
};

Pre-allocated workspace used during action post-processing to avoid repeated memory allocation and deallocation. Contains:

  • finished_rsentries: Temporary storage for request state entries that have completed generation.
  • callback_delta_outputs: Temporary storage for stream outputs to be sent via the callback.

Class: EngineStateObj

Inherits from tvm::runtime::Object and serves as the central mutable state container for the serving engine.

Public Members

Member Type Description
running_queue std::vector<Request> Requests currently being processed (actively generating tokens)
waiting_queue std::vector<Request> Requests queued but not yet started processing
request_states std::unordered_map<String, RequestState> Map from request string ID to its full state
id_manager EngineInternalIDManager Internal ID allocation manager
metrics EngineMetrics Runtime performance metrics
prefix_cache PrefixCache Prefix cache for sharing KV cache across requests with common prefixes
running_rsentries_changed bool Flag indicating if the running request state entry list has been modified (default: true)
spec_draft_length int Current speculative decoding draft length; may change dynamically in auto-spec mode. Value 0 means undefined.
disaggregation bool Flag indicating disaggregated inference mode
request_stream_callback_ FRequestStreamCallback The callback function for streaming output tokens
postproc_workspace ActionPostProcessWorkspace Pre-allocated workspace for action post-processing

Public Methods

Method Description
Reset() Resets the entire engine state and clears all metrics.
GetRequestState(Request request) Retrieves the RequestState for a given request.
GetRunningRequestStateEntries() Returns a const reference to the cached list of running request state entries. Uses the running_rsentries_changed flag to avoid redundant recomputation.

Private Members

std::vector<RequestStateEntry> cached_running_rsentries_;

A cached vector of running request state entries. This is recomputed only when running_rsentries_changed is true, providing an optimization for repeated access during a single engine step.

TVM Object Registration

static constexpr const bool _type_has_method_sequal_reduce = false;
static constexpr const bool _type_has_method_shash_reduce = false;
static constexpr const bool _type_mutable = true;
TVM_FFI_DECLARE_OBJECT_INFO_FINAL("mlc.serve.EngineState", EngineStateObj, Object);

The object is registered as mutable (_type_mutable = true) and does not support structural equality or hashing, which is appropriate for a stateful runtime object.

Class: EngineState

class EngineState : public ObjectRef {
 public:
  explicit EngineState();
  TVM_FFI_DEFINE_OBJECT_REF_METHODS_NOTNULLABLE(EngineState, ObjectRef, EngineStateObj);
};

The managed reference type for EngineStateObj. Defined as non-nullable, meaning an EngineState reference always points to a valid object.

Design Notes

  • The dual-queue architecture (running_queue and waiting_queue) enables the engine to manage request scheduling with preemption support -- requests can be moved from running back to waiting when memory pressure requires it.
  • The running_rsentries_changed flag and cached_running_rsentries_ form a simple caching mechanism to avoid recomputing the flat list of running request state entries on every access.
  • The spec_draft_length being stored in the engine state (rather than the config) allows adaptive speculative decoding, where the number of draft tokens generated can be tuned dynamically based on acceptance rates.
  • The ActionPostProcessWorkspace is explicitly allocated as part of the state to avoid per-step allocation overhead in the hot path of token generation.

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment