Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm Threaded Engine Header

From Leeroopedia


Overview

The ThreadedEngine header defines the abstract interface for the threaded serving engine in MLC LLM. Located at cpp/serve/threaded_engine.h, this file declares the ThreadedEngine class, which provides a thread-safe wrapper around the core LLM inference engine. The threaded engine runs a background request processing loop on a dedicated thread, allowing external threads to safely submit and abort requests while inference proceeds concurrently.

Purpose

The primary purpose of this header is to define a clean, virtual interface for multithreaded LLM serving. The ThreadedEngine class decouples request submission from request processing by running the engine loop in a background thread. This design enables:

  • Concurrent request handling: External threads can call AddRequest and AbortRequest without blocking the inference pipeline.
  • Lifecycle management: The engine supports initialization, reloading with new configurations, unloading, and resetting.
  • Stream callback support: A separate background loop handles streaming responses back to callers.

Class Declaration

The ThreadedEngine class resides in namespace mlc::llm::serve and is declared as an abstract base class with pure virtual methods.

class ThreadedEngine {
 public:
  static std::unique_ptr<ThreadedEngine> Create();
  virtual ~ThreadedEngine() = default;

  virtual void InitThreadedEngine(Device device, Optional<Function> request_stream_callback,
                                  Optional<EventTraceRecorder> trace_recorder) = 0;
  virtual void Reload(String engine_config_json_str) = 0;
  virtual void Unload() = 0;
  virtual void Reset() = 0;
  virtual void RunBackgroundLoop() = 0;
  virtual void RunBackgroundStreamBackLoop() = 0;
  virtual void ExitBackgroundLoop() = 0;
  virtual void AddRequest(Request request) = 0;
  virtual void AbortRequest(const String& request_id) = 0;
  virtual GenerationConfig GetDefaultGenerationConfig() const = 0;
  virtual EngineConfig GetCompleteEngineConfig() const = 0;
  virtual void DebugCallFuncOnAllAllWorker(const String& func_name, Optional<String> func_args) = 0;
};

Key Methods

Factory Method

Method Description
Create() Static factory method that returns a std::unique_ptr<ThreadedEngine>. This is the sole entry point for constructing a threaded engine instance.

Initialization and Lifecycle

Method Description
InitThreadedEngine Initializes the engine on the specified Device, optionally attaching a stream callback function and an event trace recorder.
Reload Reloads the engine with a new configuration provided as a JSON string, enabling hot-reconfiguration.
Unload Unloads the background engine, releasing associated resources.
Reset Resets the engine to its initial state, clearing all pending requests and internal state.

Background Loop Control

Method Description
RunBackgroundLoop Starts the main background request processing loop. This is intended to run on a dedicated thread.
RunBackgroundStreamBackLoop Starts the stream callback loop, which dispatches streaming results back to callers.
ExitBackgroundLoop Signals the background processing loop to exit. Designed to be invoked from a thread other than the engine-driving thread.

Request Management

Method Description
AddRequest Adds a new Request object to the engine for processing. Thread-safe.
AbortRequest Aborts an existing request identified by its request ID string. Thread-safe.

Query and Debug

Method Description
GetDefaultGenerationConfig Returns the default GenerationConfig used by the engine.
GetCompleteEngineConfig Returns the complete EngineConfig reflecting the current engine state.
DebugCallFuncOnAllAllWorker Invokes a named global function on all workers. Intended for debugging purposes only.

Dependencies

The header includes the following:

  • picojson.h -- JSON parsing library used for configuration handling.
  • data.h -- Data structures used by the serving engine.
  • engine.h -- The base engine interface that the threaded engine wraps.
  • request.h -- Definition of the Request type.

The class uses TVM runtime types via using namespace tvm::runtime, including Device, Optional, Function, and String.

File Location

  • Source file: cpp/serve/threaded_engine.h
  • Namespace: mlc::llm::serve
  • Header guard: MLC_LLM_SERVE_THREADED_ENGINE_H_

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment