Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Mlc ai Mlc llm On Device Deployment

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Mobile_Deployment
Last Updated 2026-02-09 00:00 GMT

Overview

On-device deployment is the process of transferring compiled model artifacts (inference libraries and weight files) to a physical mobile device and configuring the device environment so that the application can load and execute the model for local inference.

Description

After model libraries have been compiled and weights have been packaged, the final step in the mobile LLM deployment pipeline is getting these artifacts onto a physical device for testing and production use. This involves installing the application binary, transferring potentially large model weight files (often several gigabytes for quantized LLMs), and placing them in the correct filesystem locations where the application can discover and load them at runtime.

On-device deployment differs significantly between iOS and Android:

iOS Deployment: On iOS, model weights that are bundled (via bundle_weight: true in the package configuration) are embedded directly into the application bundle during the Xcode build process. They become part of the IPA file and are installed automatically when the app is deployed to the device. Weights that are not bundled must be downloaded by the application at runtime from a remote URL (typically Hugging Face). The iOS deployment workflow is largely handled by Xcode and does not require manual file transfer.

Android Deployment: On Android, the deployment process is more explicit. The APK (application package) is installed via ADB (Android Debug Bridge), but model weights are typically too large to embed in the APK itself. Instead, weights are pushed to the device's external storage using ADB and then moved to the application's private data directory. This two-step process (push to /data/local/tmp/, then move to the app's files directory) is necessary because ADB push cannot write directly to app-private storage on modern Android versions.

Key challenges in on-device deployment:

  • Storage constraints: Quantized LLM weights can be 2-8 GB per model. Devices must have sufficient free storage, and the transfer process must handle these large files reliably.
  • Filesystem permissions: On Android, the application's data directory (/storage/emulated/0/Android/data/<package>/files/) has restricted access. Files must be pushed to a world-writable staging area and then moved.
  • Transfer speed: USB 2.0 connections limit ADB push speeds to approximately 30-40 MB/s, making multi-gigabyte transfers take several minutes. USB 3.0 connections significantly improve this.
  • Verification: After deployment, it is important to verify that the model loads correctly and inference produces expected results, catching issues like incomplete transfers or architecture mismatches.

Usage

Use on-device deployment techniques when:

  • Testing a newly compiled model on a physical Android or iOS device
  • Setting up a device for on-device inference demonstrations
  • Deploying model weight updates to test devices during iterative development
  • Building automated device testing pipelines for mobile LLM applications

Theoretical Basis

The on-device deployment process follows a sequential pipeline:

Step 1: Application Installation
  - Android: adb install <apk_path>
  - iOS: Xcode deployment or Apple Configurator

Step 2: Weight Transfer (Android-specific)
  For each model with bundled weights:
    a. Push weights to staging area:
       adb push <local_weight_dir> /data/local/tmp/<model_id>
    b. Create app data directory:
       adb shell mkdir -p /storage/emulated/0/Android/data/<app_package>/files/
    c. Move weights to app directory:
       adb shell mv /data/local/tmp/<model_id> <app_data_dir>/

Step 3: Verification
  - Launch the application
  - Load the model via the engine API
  - Run a test inference to verify correct output

Weight delivery strategies compared:

Strategy Pros Cons
Bundle in APK/IPA No network required; works offline immediately Greatly increases app size; slow initial install; app store size limits
ADB push (dev only) Fast iteration during development; full control Requires USB connection; not suitable for end users
Runtime download Small initial app size; user downloads only needed models Requires network; first-launch delay; storage management complexity

Android filesystem layout for MLC-LLM:

/storage/emulated/0/Android/data/ai.mlc.mlcchat/files/
  +-- Llama-3.2-3B-Instruct-q4f16_1-MLC/
  |     +-- ndarray-cache.json
  |     +-- params_shard_0.bin
  |     +-- params_shard_1.bin
  |     +-- ...
  +-- gemma-2-2b-it-q4f16_1-MLC/
        +-- ndarray-cache.json
        +-- params_shard_0.bin
        +-- ...

Each model directory contains the weight shards and a cache manifest (ndarray-cache.json or tensor-cache.json) that maps parameter names to their shard files and byte offsets.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment