Workflow:Mlc ai Mlc llm Mobile Deployment

Knowledge Sources	MLC-LLM MLC-LLM Docs iOS Deployment Android Deployment Packaging Guide
Domains	LLMs, Mobile_Deployment, iOS, Android, Edge_Computing
Last Updated	2026-02-09 20:00 GMT

Overview

End-to-end process for packaging a compiled MLC-LLM model and deploying it as a native mobile application on iOS or Android devices with on-device inference.

Description

This workflow covers the complete mobile deployment pipeline for MLC-LLM, from preparing model artifacts through packaging libraries and weights for a target mobile platform, to building and running the native app. The packaging system compiles model architectures into static libraries, bundles quantized weights, and generates runtime configuration. On iOS, the output integrates with an Xcode project using the MLCSwift framework. On Android, it produces a JNI-bridged native library compatible with Kotlin/Java applications via Android Studio. All inference runs entirely on-device using the mobile GPU (Metal on iOS, OpenCL on Android).

Key outputs:

Compiled static libraries for the target mobile platform
Bundled model weights (optional, can also be downloaded at runtime)
Native mobile application with on-device LLM inference
OpenAI-compatible API accessible from Swift (iOS) or Kotlin (Android)

Usage

Execute this workflow when you need to deploy an LLM for on-device inference on mobile hardware. This is appropriate for privacy-sensitive applications where data must not leave the device, offline-capable chatbots, and mobile applications requiring low-latency inference without server dependency. The workflow supports both bundling weights into the app binary and downloading them at runtime.

Execution Steps

Step 1: Set up build environment

Install all required dependencies for the target mobile platform. For iOS, this includes CMake, Xcode, Git LFS, and the Rust toolchain. For Android, this requires Android Studio with NDK (version 27.0 recommended), CMake, JDK 17+, Rust with the aarch64-linux-android target, and proper environment variable configuration for TVM source, NDK paths, and the Android cross-compiler.

Key considerations:

iOS builds require macOS with Xcode installed
Android NDK version must match the project requirements (27.0.11718014 recommended)
Rust and Cargo are required for both platforms to build the tokenizer library
The MLC_LLM_SOURCE_DIR environment variable must point to the cloned MLC-LLM repository root

Step 2: Configure model packaging

Create or modify the mlc-package-config.json file in the application directory. This configuration specifies the target device (iphone or android), the list of models to include, estimated VRAM requirements, whether to bundle weights into the app, and optional model configuration overrides such as context window size reduction for memory-constrained devices.

Key considerations:

Set estimated_vram_bytes to help the runtime manage memory across multiple models
Enable bundle_weight for models that should be packaged directly into the app binary
Override context_window_size to reduce memory usage on mobile hardware
Multiple models can be listed for a multi-model application

Step 3: Package model libraries and weights

Run the MLC-LLM package command to compile model architectures into platform-specific static libraries, process and optionally bundle model weights, and generate the runtime configuration. The packager downloads pre-quantized weights if needed, performs JIT compilation targeting the mobile GPU, and produces the complete dist/ output directory structure.

Key considerations:

Packaging downloads and caches model weights automatically from HuggingFace
Use MLC_JIT_POLICY=REDO to force recompilation of model libraries
Output libraries are placed in dist/lib/ and bundled weights in dist/bundle/
The process generates an mlc-app-config.json for the runtime to discover available models

Step 4: Build the native application

Open the platform-specific project in the appropriate IDE and build the application. For iOS, open the MLCChat.xcodeproj in Xcode, select a physical device target, and build. For Android, open the MLCChat project in Android Studio, connect a physical device, and run the build. The build system links the packaged static libraries and bundles the weight files into the application binary.

Key considerations:

Physical devices are required for both platforms; simulators and emulators are not supported for GPU inference
On iOS, code signing requires a valid Apple Developer account
On Android, USB debugging must be enabled on the device
The build automatically picks up libraries and weights from the dist/ directory

Step 5: Deploy and test on device

Install the built application on the target device and verify model loading and inference functionality. For Android with non-bundled weights, use the bundle_weight.py script to push weight files to the device via ADB. Test the model by sending chat messages through the app UI and verifying that responses are generated correctly using the on-device GPU.

Key considerations:

First model load may involve weight download if not bundled (requires network connectivity)
On Android with Adreno GPUs, models using certain quantization suffixes may cause UI freezes during prefill
Monitor memory usage to ensure the model fits within device GPU memory limits
The MLCSwift framework (iOS) and MLCEngine Kotlin class (Android) provide programmatic access for custom apps

Execution Diagram

GitHub URL

Workflow Repository