Principle:Mlc ai Mlc llm Native Application Building
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Mobile_Deployment |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Native application building is the practice of integrating compiled LLM inference engines into mobile applications through platform-specific SDK bindings that expose a high-level, idiomatic API in the target platform's primary programming language.
Description
Once model libraries have been compiled and packaged, they must be integrated into actual mobile applications. This requires a bridge between the low-level C++ inference engine (based on TVM's runtime) and the high-level application code written in the platform's native language (Swift/Objective-C for iOS, Kotlin/Java for Android).
MLC-LLM addresses this through a layered binding architecture:
Layer 1: C++ JSON FFI Engine. At the core is a C++ engine that accepts JSON-formatted requests and returns JSON-formatted responses, following the OpenAI Chat Completions API protocol. This JSON-based interface serves as the universal contract between the native runtime and platform bindings.
Layer 2: Platform-Specific FFI Wrapper. Each platform has a thin wrapper that bridges the C++ engine to the platform's native runtime:
- On iOS, the
JSONFFIEngineis an Objective-C class that wraps the C++ engine using Objective-C's C++ interoperability, exposing methods likeinitBackgroundEngine:,reload:, andchatCompletion:requestID:. - On Android, the
JSONFFIEngineis a Java class that uses TVM's JNI bridge (tvm4j) to call into the C++ engine via TVM's Module and Function abstractions.
Layer 3: High-Level SDK. On top of the FFI wrapper, a high-level SDK provides an idiomatic API:
- On Android, the
MLCEngineKotlin class manages the engine lifecycle (initialization, background thread management) and exposes the OpenAI-compatiblechat.completions.create()interface using Kotlin coroutines and channels for asynchronous streaming. - On iOS, the Swift layer (built on top of the Objective-C
JSONFFIEngine) provides a similar high-level interface.
Streaming Architecture. Both platforms use a background thread model where:
- A background loop thread continuously processes inference requests.
- A stream-back loop thread delivers partial results (tokens) to the application via callbacks.
- The application receives streaming responses through platform-native async primitives (Kotlin Channels, Swift async/await).
Usage
Use native application building patterns when:
- Integrating MLC-LLM into a new iOS or Android application
- Building custom chat interfaces or AI-powered features that leverage on-device LLM inference
- Extending the SDK with additional functionality (e.g., function calling, custom system prompts)
- Understanding the architecture of existing MLC-LLM sample applications (MLCChat)
Theoretical Basis
The integration follows the OpenAI Chat Completions API protocol, adapted for on-device streaming:
Application Code
|
v
MLCEngine (Kotlin/Swift)
- Manages lifecycle (init, reload, reset, unload)
- Exposes chat.completions.create() API
- Handles async streaming via Channels/Callbacks
|
v
JSONFFIEngine (Java/ObjC)
- Thin FFI wrapper around C++ engine
- Methods: initBackgroundEngine, reload, chatCompletion, abort
- Manages background loop threads
|
v
C++ JSON FFI Engine (TVM Runtime)
- Accepts JSON requests (OpenAI protocol)
- Returns JSON streaming responses
- Executes compiled model kernels (Metal/OpenCL)
The request lifecycle:
- Initialization: The application creates an
MLCEngineinstance, which initializes theJSONFFIEngine, registers a stream callback, and starts two background threads (inference loop and stream-back loop). - Model Loading: The application calls
reload(modelPath, modelLib)with the model's local path and the system library name. The engine loads the compiled model library and weight parameters. - Inference: The application sends a
ChatCompletionRequest(serialized as JSON) viachatCompletion(requestJSON, requestID). The background loop processes the request and generates tokens. - Streaming: As tokens are generated, the stream-back loop delivers partial
ChatCompletionStreamResponseobjects to the registered callback. On Android, these are deserialized and routed to the appropriate Kotlin Channel based on the request ID. - Cleanup: The application can call
reset()to clear conversation state,unload()to release model resources, or destroy the engine entirely.
Key architectural choices:
| Design Choice | Rationale |
|---|---|
| JSON-based FFI | Provides a language-agnostic contract; easy to serialize/deserialize on both sides; follows the widely-adopted OpenAI API format |
| Background thread model | Keeps the UI thread responsive during inference; allows streaming token delivery without blocking |
| Request ID tracking | Enables concurrent requests and correct routing of streaming responses to the originating caller |
| OpenAI protocol compatibility | Enables code reuse between server-side and on-device deployments; familiar API for developers |