Implementation:Mlc ai Mlc llm Engine Interface

Knowledge Sources	Mlc_ai_Mlc_llm
Domains	LLM Serving, Engine Architecture, Request Management
Last Updated	2026-02-09 19:00 GMT

Overview

The Engine Interface defines the abstract base class for the MLC LLM serving engine. It establishes the contract for request-based text generation, supporting single or multiple LLM models (for speculative inference). The engine manages request lifecycles from submission through generation to completion, returning results via a callback stream function.

Description

This header file (cpp/serve/engine.h) declares the Engine class as a pure abstract interface organized into three categories of operations:

Engine Management:

Create -- Static factory method that constructs an engine from a JSON config string, device specification, callback function, and optional trace recorder. Returns a Result<EngineCreationOutput> containing the engine instance, the completed engine config, and default generation config.
Reset -- Clears all running data and resets metrics.
Empty -- Checks whether the engine has any pending requests.
GetRequestStreamCallback / SetRequestStreamCallback -- Accessor and mutator for the stream callback function.

Request Management:

AddRequest -- Submits a new Request to the engine's queue.
AbortRequest -- Cancels a specific request by its ID string.
AbortAllRequests -- Cancels all pending and running requests.

Engine Action:

Step -- The core function driving the engine's main loop. Each invocation may perform prefill for new requests, decode for running requests, or other actions. After certain actions (e.g., decode), the engine checks for finished requests and delivers results via the callback.

Debug/Profile:

JSONMetrics -- Returns internal engine metrics as a JSON string.
DebugCallFuncOnAllAllWorker -- Invokes a named global function on all workers (debug only).

The header also defines:

EngineCreationOutput: A struct bundling the created engine (as std::unique_ptr<Engine>), the completed EngineConfig, and the default GenerationConfig.
AbortRequestImpl: A free function implementing request abortion logic that can be called from engine action implementations.

Usage

The engine is the central orchestrator of the MLC LLM serving system:

An Engine is created via Engine::Create() with a JSON config specifying models, devices, and serving parameters.
Clients submit requests via AddRequest().
A main loop repeatedly calls Step(), which drives prefill, decode, and other actions.
Results flow back through the FRequestStreamCallback as RequestStreamOutput objects.
Requests can be aborted individually or collectively at any time.

Code Reference

Source Location

Property	Value
File	`cpp/serve/engine.h`
Namespace	`mlc::llm::serve`
Lines	126
Include Guard	`MLC_LLM_SERVE_ENGINE_H_`

Signature

namespace mlc {
namespace llm {
namespace serve {

struct EngineCreationOutput {
  std::unique_ptr<Engine> reloaded_engine;
  EngineConfig completed_engine_config;
  GenerationConfig default_generation_cfg;
};

class Engine {
 public:
  virtual ~Engine() = default;

  // Engine Management
  static Result<EngineCreationOutput> Create(
      const std::string& engine_config_json_str,
      Device device,
      FRequestStreamCallback request_stream_callback,
      Optional<EventTraceRecorder> trace_recorder);

  virtual void Reset() = 0;
  virtual bool Empty() = 0;
  virtual FRequestStreamCallback GetRequestStreamCallback() = 0;
  virtual void SetRequestStreamCallback(FRequestStreamCallback request_stream_callback) = 0;

  // Request Management
  virtual void AddRequest(Request request) = 0;
  virtual void AbortRequest(const String& request_id) = 0;
  virtual void AbortAllRequests() = 0;

  // Engine Action
  virtual void Step() = 0;

  // Debug/Profile
  virtual String JSONMetrics() = 0;
  virtual void DebugCallFuncOnAllAllWorker(const String& func_name,
                                           Optional<String> func_args) = 0;
};

void AbortRequestImpl(EngineState estate, const Array<Model>& models,
                       const String& request_id, String finish_reason = "abort");

}  // namespace serve
}  // namespace llm
}  // namespace mlc

Import

#include "serve/engine.h"

Dependencies:

data.h for serving data types and RequestStreamOutput
engine_state.h for EngineState and EngineConfig
event_trace_recorder.h for EventTraceRecorder
request.h for the Request class
request_state.h for request state tracking

I/O Contract

Engine::Create

Direction	Name	Type	Description
Input	engine_config_json_str	`const std::string&`	JSON string specifying models, device, KV cache size, etc.
Input	device	`Device`	The device to run models on (e.g., GPU)
Input	request_stream_callback	`FRequestStreamCallback`	Callback function for streaming generation results
Input	trace_recorder	`Optional<EventTraceRecorder>`	Optional event trace recorder for profiling
Output	(return)	`Result<EngineCreationOutput>`	The engine, completed config, and default generation config, or an error

Engine::AddRequest

Direction	Name	Type	Description
Input	request	`Request`	A request object containing input data, generation config, and request ID

Engine::Step

Direction	Name	Type	Description
Side Effect	(callback)	`FRequestStreamCallback`	May invoke the stream callback with `RequestStreamOutput` for finished or in-progress requests

AbortRequestImpl

Direction	Name	Type	Description
Input	estate	`EngineState`	The current engine state to modify
Input	models	`const Array<Model>&`	Models whose KV caches may need cleanup
Input	request_id	`const String&`	ID of the request to abort
Input	finish_reason	`String`	Reason for abortion (default: "abort")

Usage Examples

Creating and running an engine:

#include "serve/engine.h"

// Define the stream callback
auto callback = [](RequestStreamOutput output) {
  // Process incremental generation results
  LOG(INFO) << "Got output for request: " << output->request_id;
};

// Create the engine
std::string config_json = R"({"model": "dist/llama-2-7b-q4f16_1", ...})";
Result<EngineCreationOutput> result = Engine::Create(
    config_json, device, callback, Optional<EventTraceRecorder>());

if (result.IsOk()) {
  auto output = result.Unwrap();
  auto& engine = output.reloaded_engine;

  // Add a request
  Request request = /* construct request */;
  engine->AddRequest(request);

  // Main serving loop
  while (!engine->Empty()) {
    engine->Step();
  }
}

Aborting a request:

// Abort a specific request
engine->AbortRequest("req-12345");

// Abort all requests (e.g., during shutdown)
engine->AbortAllRequests();

Related Pages

Mlc_ai_Mlc_llm_Engine_Action - The action abstraction that implements Step behavior
Mlc_ai_Mlc_llm_Serve_Data_Header - Data types processed by the engine
Mlc_ai_Mlc_llm_Serve_Data - Data implementation used in the serving pipeline
Mlc_ai_Mlc_llm_OpenAI_API_Protocol_Header - API protocol for requests and responses
Mlc_ai_Mlc_llm_Draft_Token_Workspace - Workspace manager used in speculative decoding mode
Mlc_ai_Mlc_llm_Model_Metadata_Header - Model metadata consulted during engine creation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment