Principle:Mlc ai Mlc llm Model Packaging Configuration
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Mobile_Deployment |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Model packaging configuration is a declarative approach to specifying which language models should be compiled, bundled, and deployed to a target mobile platform, along with their resource budgets and optimization parameters.
Description
Mobile devices have strict constraints on memory, storage, and compute resources. Unlike server-side deployment where models can be loaded dynamically and memory is relatively abundant, mobile deployment requires careful upfront planning of which models will be available, how much VRAM each model is expected to consume, and whether model weights should be bundled directly into the application binary or downloaded at runtime.
A model packaging configuration serves as the single source of truth for the mobile packaging pipeline. It declares:
- Target device -- Whether the package targets iOS (
iphone) or Android (android), which determines the compilation backend (Metal vs. OpenCL), library format (static vs. shared), and weight handling strategy. - Model list -- An array of model entries, each specifying the model source (typically a Hugging Face repository), a human-readable model identifier, the estimated VRAM consumption in bytes, and optional compilation overrides.
- Compilation overrides -- Per-model adjustments to default compilation parameters such as
prefill_chunk_size,context_window_size, andsliding_window_sizethat tune the model for the memory and latency characteristics of mobile hardware. - Weight bundling -- A boolean flag controlling whether model weights are embedded directly into the application package (increasing app size but enabling offline use) or downloaded on first launch (smaller initial download but requires network connectivity).
This declarative approach separates the what (which models, with what settings) from the how (cross-compilation, library linking, APK/IPA assembly), enabling developers to modify their mobile model lineup without touching build scripts or application code.
Usage
Use model packaging configuration when:
- Defining which models an iOS or Android application should support
- Tuning model compilation parameters (chunk sizes, context windows) for specific device memory profiles
- Deciding between bundling model weights in-app versus downloading them on first launch
- Managing multiple deployment configurations (e.g., a lightweight build with one small model versus a full-featured build with several models)
Theoretical Basis
The configuration schema follows a declarative pattern common in mobile development (analogous to Gradle build files or Xcode project settings):
{
"device": "<iphone|android>",
"model_list": [
{
"model": "<HF://org/model-name or local path>",
"model_id": "<unique string identifier>",
"estimated_vram_bytes": <integer>,
"overrides": {
"prefill_chunk_size": <integer>,
"context_window_size": <integer>,
"sliding_window_size": <integer>
},
"bundle_weight": <true|false>
}
]
}
Schema semantics:
| Field | Type | Description |
|---|---|---|
device |
string | Target platform. Determines the compilation backend and library format. |
model_list |
array | List of model entries to include in the package. |
model |
string | Model source. Typically a Hugging Face path (e.g., HF://mlc-ai/Llama-3.2-3B-Instruct-q4f16_1-MLC) or a local directory.
|
model_id |
string | Unique identifier used to reference the model at runtime and for weight directory naming. |
estimated_vram_bytes |
integer | Expected peak VRAM usage in bytes. Used by the runtime to manage model loading and eviction. |
overrides |
object | Optional compilation parameter overrides that adjust memory/latency tradeoffs. |
bundle_weight |
boolean | When true, model weights are copied into the application bundle. Defaults to false. |
Design tradeoffs:
- Bundled weights increase the application size significantly (often by several gigabytes for quantized LLMs) but guarantee offline availability and eliminate first-launch download latency.
- Smaller
prefill_chunk_sizereduces peak memory usage during the prefill phase at the cost of increased prefill latency, which is a practical tradeoff for memory-constrained mobile devices. - Reduced
context_window_sizelimits the maximum conversation length but proportionally reduces the KV cache memory footprint, which is often the binding constraint on mobile VRAM.