Implementation:Mlc ai Mlc llm Engine State

Overview

File: cpp/serve/engine_state.h

Purpose: Defines the core runtime state of the MLC LLM serving engine. The EngineStateObj class holds all mutable state required during inference -- including the running and waiting request queues, request states, internal ID management, runtime metrics, prefix cache, and speculative decoding configuration. This state object is passed to all engine actions and updated throughout the serving pipeline.

Namespace: mlc::llm::serve

Type Alias

typedef TypedFunction<void(Array<RequestStreamOutput>)> FRequestStreamCallback;

Defines the callback function type for streaming request outputs. The callback receives an array of RequestStreamOutput objects and is invoked during action post-processing to deliver generated tokens to the caller.

Struct: EngineInternalIDManager

struct EngineInternalIDManager {
  std::vector<int64_t> available_ids;
  int64_t id_cnt = 0;

  int64_t GetNewId();
  void RecycleId(int64_t id);
  void Reset();
};

Manages the internal integer IDs assigned to requests within the engine. This is separate from the user-facing string request IDs.

Method	Description
`GetNewId()`	Returns an unused ID. First attempts to reuse a recycled ID from `available_ids`; otherwise increments `id_cnt` and returns a new ID.
`RecycleId(int64_t id)`	Returns an ID to the pool of available IDs for reuse.
`Reset()`	Clears all available IDs and resets the counter to zero.

The ID recycling mechanism ensures efficient reuse of internal IDs as requests complete, preventing unbounded ID growth during long-running serving sessions.

Struct: ActionPostProcessWorkspace

struct ActionPostProcessWorkspace {
  std::vector<RequestStateEntry> finished_rsentries;
  Array<RequestStreamOutput> callback_delta_outputs;
};

Pre-allocated workspace used during action post-processing to avoid repeated memory allocation and deallocation. Contains:

finished_rsentries: Temporary storage for request state entries that have completed generation.
callback_delta_outputs: Temporary storage for stream outputs to be sent via the callback.

Class: EngineStateObj

Inherits from tvm::runtime::Object and serves as the central mutable state container for the serving engine.

Public Members

Member	Type	Description
`running_queue`	`std::vector<Request>`	Requests currently being processed (actively generating tokens)
`waiting_queue`	`std::vector<Request>`	Requests queued but not yet started processing
`request_states`	`std::unordered_map<String, RequestState>`	Map from request string ID to its full state
`id_manager`	`EngineInternalIDManager`	Internal ID allocation manager
`metrics`	`EngineMetrics`	Runtime performance metrics
`prefix_cache`	`PrefixCache`	Prefix cache for sharing KV cache across requests with common prefixes
`running_rsentries_changed`	`bool`	Flag indicating if the running request state entry list has been modified (default: `true`)
`spec_draft_length`	`int`	Current speculative decoding draft length; may change dynamically in auto-spec mode. Value 0 means undefined.
`disaggregation`	`bool`	Flag indicating disaggregated inference mode
`request_stream_callback_`	`FRequestStreamCallback`	The callback function for streaming output tokens
`postproc_workspace`	`ActionPostProcessWorkspace`	Pre-allocated workspace for action post-processing

Public Methods

Method	Description
`Reset()`	Resets the entire engine state and clears all metrics.
`GetRequestState(Request request)`	Retrieves the `RequestState` for a given request.
`GetRunningRequestStateEntries()`	Returns a const reference to the cached list of running request state entries. Uses the `running_rsentries_changed` flag to avoid redundant recomputation.

Private Members

std::vector<RequestStateEntry> cached_running_rsentries_;

A cached vector of running request state entries. This is recomputed only when running_rsentries_changed is true, providing an optimization for repeated access during a single engine step.

TVM Object Registration

static constexpr const bool _type_has_method_sequal_reduce = false;
static constexpr const bool _type_has_method_shash_reduce = false;
static constexpr const bool _type_mutable = true;
TVM_FFI_DECLARE_OBJECT_INFO_FINAL("mlc.serve.EngineState", EngineStateObj, Object);

The object is registered as mutable (_type_mutable = true) and does not support structural equality or hashing, which is appropriate for a stateful runtime object.

Class: EngineState

class EngineState : public ObjectRef {
 public:
  explicit EngineState();
  TVM_FFI_DEFINE_OBJECT_REF_METHODS_NOTNULLABLE(EngineState, ObjectRef, EngineStateObj);
};

The managed reference type for EngineStateObj. Defined as non-nullable, meaning an EngineState reference always points to a valid object.

Design Notes

The dual-queue architecture (running_queue and waiting_queue) enables the engine to manage request scheduling with preemption support -- requests can be moved from running back to waiting when memory pressure requires it.
The running_rsentries_changed flag and cached_running_rsentries_ form a simple caching mechanism to avoid recomputing the flat list of running request state entries on every access.
The spec_draft_length being stored in the engine state (rather than the config) allows adaptive speculative decoding, where the number of draft tokens generated can be tuned dynamically based on acceptance rates.
The ActionPostProcessWorkspace is explicitly allocated as part of the state to avoid per-step allocation overhead in the hot path of token generation.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment