Principle:Mlc ai Mlc llm On Device Deployment

Knowledge Sources	MLC-LLM
Domains	Deep_Learning, Mobile_Deployment
Last Updated	2026-02-09 00:00 GMT

Overview

On-device deployment is the process of transferring compiled model artifacts (inference libraries and weight files) to a physical mobile device and configuring the device environment so that the application can load and execute the model for local inference.

Description

After model libraries have been compiled and weights have been packaged, the final step in the mobile LLM deployment pipeline is getting these artifacts onto a physical device for testing and production use. This involves installing the application binary, transferring potentially large model weight files (often several gigabytes for quantized LLMs), and placing them in the correct filesystem locations where the application can discover and load them at runtime.

On-device deployment differs significantly between iOS and Android:

iOS Deployment: On iOS, model weights that are bundled (via bundle_weight: true in the package configuration) are embedded directly into the application bundle during the Xcode build process. They become part of the IPA file and are installed automatically when the app is deployed to the device. Weights that are not bundled must be downloaded by the application at runtime from a remote URL (typically Hugging Face). The iOS deployment workflow is largely handled by Xcode and does not require manual file transfer.

Android Deployment: On Android, the deployment process is more explicit. The APK (application package) is installed via ADB (Android Debug Bridge), but model weights are typically too large to embed in the APK itself. Instead, weights are pushed to the device's external storage using ADB and then moved to the application's private data directory. This two-step process (push to /data/local/tmp/, then move to the app's files directory) is necessary because ADB push cannot write directly to app-private storage on modern Android versions.

Key challenges in on-device deployment:

Storage constraints: Quantized LLM weights can be 2-8 GB per model. Devices must have sufficient free storage, and the transfer process must handle these large files reliably.
Filesystem permissions: On Android, the application's data directory (/storage/emulated/0/Android/data/<package>/files/) has restricted access. Files must be pushed to a world-writable staging area and then moved.
Transfer speed: USB 2.0 connections limit ADB push speeds to approximately 30-40 MB/s, making multi-gigabyte transfers take several minutes. USB 3.0 connections significantly improve this.
Verification: After deployment, it is important to verify that the model loads correctly and inference produces expected results, catching issues like incomplete transfers or architecture mismatches.

Usage

Use on-device deployment techniques when:

Testing a newly compiled model on a physical Android or iOS device
Setting up a device for on-device inference demonstrations
Deploying model weight updates to test devices during iterative development
Building automated device testing pipelines for mobile LLM applications

Theoretical Basis

The on-device deployment process follows a sequential pipeline:

Step 1: Application Installation
  - Android: adb install <apk_path>
  - iOS: Xcode deployment or Apple Configurator

Step 2: Weight Transfer (Android-specific)
  For each model with bundled weights:
    a. Push weights to staging area:
       adb push <local_weight_dir> /data/local/tmp/<model_id>
    b. Create app data directory:
       adb shell mkdir -p /storage/emulated/0/Android/data/<app_package>/files/
    c. Move weights to app directory:
       adb shell mv /data/local/tmp/<model_id> <app_data_dir>/

Step 3: Verification
  - Launch the application
  - Load the model via the engine API
  - Run a test inference to verify correct output

Weight delivery strategies compared:

Strategy	Pros	Cons
Bundle in APK/IPA	No network required; works offline immediately	Greatly increases app size; slow initial install; app store size limits
ADB push (dev only)	Fast iteration during development; full control	Requires USB connection; not suitable for end users
Runtime download	Small initial app size; user downloads only needed models	Requires network; first-launch delay; storage management complexity

Android filesystem layout for MLC-LLM:

/storage/emulated/0/Android/data/ai.mlc.mlcchat/files/
  +-- Llama-3.2-3B-Instruct-q4f16_1-MLC/
  |     +-- ndarray-cache.json
  |     +-- params_shard_0.bin
  |     +-- params_shard_1.bin
  |     +-- ...
  +-- gemma-2-2b-it-q4f16_1-MLC/
        +-- ndarray-cache.json
        +-- params_shard_0.bin
        +-- ...

Each model directory contains the weight shards and a cache manifest (ndarray-cache.json or tensor-cache.json) that maps parameter names to their shard files and byte offsets.

Related Pages

Implemented By

Implementation:Mlc_ai_Mlc_llm_Bundle_weight

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment