Implementation:Mlc ai Mlc llm Engine State
Overview
File: cpp/serve/engine_state.h
Purpose: Defines the core runtime state of the MLC LLM serving engine. The EngineStateObj class holds all mutable state required during inference -- including the running and waiting request queues, request states, internal ID management, runtime metrics, prefix cache, and speculative decoding configuration. This state object is passed to all engine actions and updated throughout the serving pipeline.
Namespace: mlc::llm::serve
Type Alias
typedef TypedFunction<void(Array<RequestStreamOutput>)> FRequestStreamCallback;
Defines the callback function type for streaming request outputs. The callback receives an array of RequestStreamOutput objects and is invoked during action post-processing to deliver generated tokens to the caller.
Struct: EngineInternalIDManager
struct EngineInternalIDManager {
std::vector<int64_t> available_ids;
int64_t id_cnt = 0;
int64_t GetNewId();
void RecycleId(int64_t id);
void Reset();
};
Manages the internal integer IDs assigned to requests within the engine. This is separate from the user-facing string request IDs.
| Method | Description |
|---|---|
GetNewId() |
Returns an unused ID. First attempts to reuse a recycled ID from available_ids; otherwise increments id_cnt and returns a new ID.
|
RecycleId(int64_t id) |
Returns an ID to the pool of available IDs for reuse. |
Reset() |
Clears all available IDs and resets the counter to zero. |
The ID recycling mechanism ensures efficient reuse of internal IDs as requests complete, preventing unbounded ID growth during long-running serving sessions.
Struct: ActionPostProcessWorkspace
struct ActionPostProcessWorkspace {
std::vector<RequestStateEntry> finished_rsentries;
Array<RequestStreamOutput> callback_delta_outputs;
};
Pre-allocated workspace used during action post-processing to avoid repeated memory allocation and deallocation. Contains:
finished_rsentries: Temporary storage for request state entries that have completed generation.callback_delta_outputs: Temporary storage for stream outputs to be sent via the callback.
Class: EngineStateObj
Inherits from tvm::runtime::Object and serves as the central mutable state container for the serving engine.
Public Members
| Member | Type | Description |
|---|---|---|
running_queue |
std::vector<Request> |
Requests currently being processed (actively generating tokens) |
waiting_queue |
std::vector<Request> |
Requests queued but not yet started processing |
request_states |
std::unordered_map<String, RequestState> |
Map from request string ID to its full state |
id_manager |
EngineInternalIDManager |
Internal ID allocation manager |
metrics |
EngineMetrics |
Runtime performance metrics |
prefix_cache |
PrefixCache |
Prefix cache for sharing KV cache across requests with common prefixes |
running_rsentries_changed |
bool |
Flag indicating if the running request state entry list has been modified (default: true)
|
spec_draft_length |
int |
Current speculative decoding draft length; may change dynamically in auto-spec mode. Value 0 means undefined. |
disaggregation |
bool |
Flag indicating disaggregated inference mode |
request_stream_callback_ |
FRequestStreamCallback |
The callback function for streaming output tokens |
postproc_workspace |
ActionPostProcessWorkspace |
Pre-allocated workspace for action post-processing |
Public Methods
| Method | Description |
|---|---|
Reset() |
Resets the entire engine state and clears all metrics. |
GetRequestState(Request request) |
Retrieves the RequestState for a given request.
|
GetRunningRequestStateEntries() |
Returns a const reference to the cached list of running request state entries. Uses the running_rsentries_changed flag to avoid redundant recomputation.
|
Private Members
std::vector<RequestStateEntry> cached_running_rsentries_;
A cached vector of running request state entries. This is recomputed only when running_rsentries_changed is true, providing an optimization for repeated access during a single engine step.
TVM Object Registration
static constexpr const bool _type_has_method_sequal_reduce = false;
static constexpr const bool _type_has_method_shash_reduce = false;
static constexpr const bool _type_mutable = true;
TVM_FFI_DECLARE_OBJECT_INFO_FINAL("mlc.serve.EngineState", EngineStateObj, Object);
The object is registered as mutable (_type_mutable = true) and does not support structural equality or hashing, which is appropriate for a stateful runtime object.
Class: EngineState
class EngineState : public ObjectRef {
public:
explicit EngineState();
TVM_FFI_DEFINE_OBJECT_REF_METHODS_NOTNULLABLE(EngineState, ObjectRef, EngineStateObj);
};
The managed reference type for EngineStateObj. Defined as non-nullable, meaning an EngineState reference always points to a valid object.
Design Notes
- The dual-queue architecture (
running_queueandwaiting_queue) enables the engine to manage request scheduling with preemption support -- requests can be moved from running back to waiting when memory pressure requires it. - The
running_rsentries_changedflag andcached_running_rsentries_form a simple caching mechanism to avoid recomputing the flat list of running request state entries on every access. - The
spec_draft_lengthbeing stored in the engine state (rather than the config) allows adaptive speculative decoding, where the number of draft tokens generated can be tuned dynamically based on acceptance rates. - The
ActionPostProcessWorkspaceis explicitly allocated as part of the state to avoid per-step allocation overhead in the hot path of token generation.