Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Mlc ai Mlc llm Native Application Building

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Mobile_Deployment
Last Updated 2026-02-09 00:00 GMT

Overview

Native application building is the practice of integrating compiled LLM inference engines into mobile applications through platform-specific SDK bindings that expose a high-level, idiomatic API in the target platform's primary programming language.

Description

Once model libraries have been compiled and packaged, they must be integrated into actual mobile applications. This requires a bridge between the low-level C++ inference engine (based on TVM's runtime) and the high-level application code written in the platform's native language (Swift/Objective-C for iOS, Kotlin/Java for Android).

MLC-LLM addresses this through a layered binding architecture:

Layer 1: C++ JSON FFI Engine. At the core is a C++ engine that accepts JSON-formatted requests and returns JSON-formatted responses, following the OpenAI Chat Completions API protocol. This JSON-based interface serves as the universal contract between the native runtime and platform bindings.

Layer 2: Platform-Specific FFI Wrapper. Each platform has a thin wrapper that bridges the C++ engine to the platform's native runtime:

  • On iOS, the JSONFFIEngine is an Objective-C class that wraps the C++ engine using Objective-C's C++ interoperability, exposing methods like initBackgroundEngine:, reload:, and chatCompletion:requestID:.
  • On Android, the JSONFFIEngine is a Java class that uses TVM's JNI bridge (tvm4j) to call into the C++ engine via TVM's Module and Function abstractions.

Layer 3: High-Level SDK. On top of the FFI wrapper, a high-level SDK provides an idiomatic API:

  • On Android, the MLCEngine Kotlin class manages the engine lifecycle (initialization, background thread management) and exposes the OpenAI-compatible chat.completions.create() interface using Kotlin coroutines and channels for asynchronous streaming.
  • On iOS, the Swift layer (built on top of the Objective-C JSONFFIEngine) provides a similar high-level interface.

Streaming Architecture. Both platforms use a background thread model where:

  1. A background loop thread continuously processes inference requests.
  2. A stream-back loop thread delivers partial results (tokens) to the application via callbacks.
  3. The application receives streaming responses through platform-native async primitives (Kotlin Channels, Swift async/await).

Usage

Use native application building patterns when:

  • Integrating MLC-LLM into a new iOS or Android application
  • Building custom chat interfaces or AI-powered features that leverage on-device LLM inference
  • Extending the SDK with additional functionality (e.g., function calling, custom system prompts)
  • Understanding the architecture of existing MLC-LLM sample applications (MLCChat)

Theoretical Basis

The integration follows the OpenAI Chat Completions API protocol, adapted for on-device streaming:

Application Code
      |
      v
MLCEngine (Kotlin/Swift)
  - Manages lifecycle (init, reload, reset, unload)
  - Exposes chat.completions.create() API
  - Handles async streaming via Channels/Callbacks
      |
      v
JSONFFIEngine (Java/ObjC)
  - Thin FFI wrapper around C++ engine
  - Methods: initBackgroundEngine, reload, chatCompletion, abort
  - Manages background loop threads
      |
      v
C++ JSON FFI Engine (TVM Runtime)
  - Accepts JSON requests (OpenAI protocol)
  - Returns JSON streaming responses
  - Executes compiled model kernels (Metal/OpenCL)

The request lifecycle:

  1. Initialization: The application creates an MLCEngine instance, which initializes the JSONFFIEngine, registers a stream callback, and starts two background threads (inference loop and stream-back loop).
  2. Model Loading: The application calls reload(modelPath, modelLib) with the model's local path and the system library name. The engine loads the compiled model library and weight parameters.
  3. Inference: The application sends a ChatCompletionRequest (serialized as JSON) via chatCompletion(requestJSON, requestID). The background loop processes the request and generates tokens.
  4. Streaming: As tokens are generated, the stream-back loop delivers partial ChatCompletionStreamResponse objects to the registered callback. On Android, these are deserialized and routed to the appropriate Kotlin Channel based on the request ID.
  5. Cleanup: The application can call reset() to clear conversation state, unload() to release model resources, or destroy the engine entirely.

Key architectural choices:

Design Choice Rationale
JSON-based FFI Provides a language-agnostic contract; easy to serialize/deserialize on both sides; follows the widely-adopted OpenAI API format
Background thread model Keeps the UI thread responsive during inference; allows streaming token delivery without blocking
Request ID tracking Enables concurrent requests and correct routing of streaming responses to the originating caller
OpenAI protocol compatibility Enables code reuse between server-side and on-device deployments; familiar API for developers

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment