Implementation:Mlc ai Mlc llm Threaded Engine Header
Overview
The ThreadedEngine header defines the abstract interface for the threaded serving engine in MLC LLM. Located at cpp/serve/threaded_engine.h, this file declares the ThreadedEngine class, which provides a thread-safe wrapper around the core LLM inference engine. The threaded engine runs a background request processing loop on a dedicated thread, allowing external threads to safely submit and abort requests while inference proceeds concurrently.
Purpose
The primary purpose of this header is to define a clean, virtual interface for multithreaded LLM serving. The ThreadedEngine class decouples request submission from request processing by running the engine loop in a background thread. This design enables:
- Concurrent request handling: External threads can call
AddRequestandAbortRequestwithout blocking the inference pipeline. - Lifecycle management: The engine supports initialization, reloading with new configurations, unloading, and resetting.
- Stream callback support: A separate background loop handles streaming responses back to callers.
Class Declaration
The ThreadedEngine class resides in namespace mlc::llm::serve and is declared as an abstract base class with pure virtual methods.
class ThreadedEngine {
public:
static std::unique_ptr<ThreadedEngine> Create();
virtual ~ThreadedEngine() = default;
virtual void InitThreadedEngine(Device device, Optional<Function> request_stream_callback,
Optional<EventTraceRecorder> trace_recorder) = 0;
virtual void Reload(String engine_config_json_str) = 0;
virtual void Unload() = 0;
virtual void Reset() = 0;
virtual void RunBackgroundLoop() = 0;
virtual void RunBackgroundStreamBackLoop() = 0;
virtual void ExitBackgroundLoop() = 0;
virtual void AddRequest(Request request) = 0;
virtual void AbortRequest(const String& request_id) = 0;
virtual GenerationConfig GetDefaultGenerationConfig() const = 0;
virtual EngineConfig GetCompleteEngineConfig() const = 0;
virtual void DebugCallFuncOnAllAllWorker(const String& func_name, Optional<String> func_args) = 0;
};
Key Methods
Factory Method
| Method | Description |
|---|---|
Create() |
Static factory method that returns a std::unique_ptr<ThreadedEngine>. This is the sole entry point for constructing a threaded engine instance.
|
Initialization and Lifecycle
| Method | Description |
|---|---|
InitThreadedEngine |
Initializes the engine on the specified Device, optionally attaching a stream callback function and an event trace recorder.
|
Reload |
Reloads the engine with a new configuration provided as a JSON string, enabling hot-reconfiguration. |
Unload |
Unloads the background engine, releasing associated resources. |
Reset |
Resets the engine to its initial state, clearing all pending requests and internal state. |
Background Loop Control
| Method | Description |
|---|---|
RunBackgroundLoop |
Starts the main background request processing loop. This is intended to run on a dedicated thread. |
RunBackgroundStreamBackLoop |
Starts the stream callback loop, which dispatches streaming results back to callers. |
ExitBackgroundLoop |
Signals the background processing loop to exit. Designed to be invoked from a thread other than the engine-driving thread. |
Request Management
| Method | Description |
|---|---|
AddRequest |
Adds a new Request object to the engine for processing. Thread-safe.
|
AbortRequest |
Aborts an existing request identified by its request ID string. Thread-safe. |
Query and Debug
| Method | Description |
|---|---|
GetDefaultGenerationConfig |
Returns the default GenerationConfig used by the engine.
|
GetCompleteEngineConfig |
Returns the complete EngineConfig reflecting the current engine state.
|
DebugCallFuncOnAllAllWorker |
Invokes a named global function on all workers. Intended for debugging purposes only. |
Dependencies
The header includes the following:
picojson.h-- JSON parsing library used for configuration handling.data.h-- Data structures used by the serving engine.engine.h-- The base engine interface that the threaded engine wraps.request.h-- Definition of theRequesttype.
The class uses TVM runtime types via using namespace tvm::runtime, including Device, Optional, Function, and String.
File Location
- Source file:
cpp/serve/threaded_engine.h - Namespace:
mlc::llm::serve - Header guard:
MLC_LLM_SERVE_THREADED_ENGINE_H_