Environment:Mlc ai Mlc llm Metal macOS iOS Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Mobile, GPU_Acceleration |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Apple Metal GPU environment for macOS desktop and iOS on-device LLM inference, using Xcode toolchain for compilation and static library linking.
Description
This environment enables LLM inference on Apple Silicon (M1/M2/M3/M4) and iOS devices via the Metal GPU backend. For macOS, models are compiled to `.dylib` shared libraries. For iOS, models are compiled to `.tar` static archives that are linked into Xcode projects via the MLCSwift framework. The Metal backend has a thread warp size of 1 and max shared memory of 32KB per block, which affects kernel scheduling. The environment uses TIR-based PagedKVCache (FlashInfer is not available on Metal).
Usage
Use this environment when deploying LLM models to macOS desktops or iOS devices (iPhone/iPad). It is required for the Mobile Deployment workflow and for running MLCEngine on Apple platforms.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | macOS 13+ (Ventura) / iOS 16+ | Apple Silicon (M1+) recommended for macOS |
| Hardware | Apple GPU (Metal-capable) | M1/M2/M3/M4 for macOS; A14+ for iOS |
| Toolchain | Xcode 14+ | Required for iOS builds and Metal shader compilation |
| Disk | 5GB+ | For model weights and compiled libraries |
Dependencies
System Packages
- `xcode` (Apple Xcode with Metal SDK)
- `cmake` < 4.0
- `git`
iOS Build Dependencies
- MLCSwift framework (included in repository at `ios/MLCSwift/`)
- `prepare_libs.sh` script (builds static libraries for device or simulator)
Python Packages
- `apache-tvm-ffi` (TVM FFI bindings)
- `torch` (for weight conversion)
- `transformers`
- `safetensors`
Credentials
No special credentials required. Apple Developer account needed for iOS device deployment.
Quick Install
# For macOS compilation
pip install mlc-llm
# For iOS: build static libraries
cd ios && ./prepare_libs.sh
# For iOS simulator target
cd ios && ./prepare_libs.sh --simulator
Code Evidence
Metal target preset from `auto_target.py:394-407`:
"iphone:generic": {
"target": {
"kind": "metal",
"max_threads_per_block": 256,
"max_shared_memory_per_block": 32768,
"thread_warp_size": 1,
"libs": ["iphoneos"],
"host": {
"kind": "llvm",
"mtriple": "arm64-apple-darwin",
},
},
"build": _build_iphone,
},
Metal KV cache capacity limit from `config.cc:746-751`:
if (device.device_type == DLDeviceType::kDLMetal) {
// NOTE: Metal runtime has severe performance issues with large buffers.
// To work around the issue, we limit the KV cache capacity to 32768.
model_max_total_sequence_length =
std::min(model_max_total_sequence_length, static_cast<int64_t>(32768));
}
iOS build function from `auto_target.py:161-182`:
def _build_iphone():
@register_global_func("tvm_callback_metal_compile", override=True)
def compile_metal(src, target):
if target.libs:
return xcode.compile_metal(src, sdk=target.libs[0])
return xcode.compile_metal(src)
def build(mod, args, pipeline=None):
output = args.output
mod = _add_system_lib_prefix(mod, args.system_lib_prefix, is_system_lib=True)
assert output.suffix == ".tar"
relax.build(mod, target=args.target, relax_pipeline=pipeline,
system_lib=True).export_library(str(output), fcompile=tar.tar)
return build
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| Metal shader compilation failure | Xcode SDK mismatch | Ensure Xcode Command Line Tools are installed: `xcode-select --install` |
| KV cache limited to 32768 tokens | Metal large buffer performance workaround | This is intentional; use CUDA for larger context windows |
| `--system-lib-prefix is not specified` | Missing prefix for static library build | Pass `--system-lib-prefix` flag or let auto-detection handle it |
Compatibility Notes
- KV Cache Limit: Metal runtime has severe performance issues with large buffers. MLC-LLM automatically caps KV cache capacity at 32768 tokens on Metal devices.
- FlashInfer: Not available on Metal. The TIR-based PagedKVCache is used instead.
- Thread Warp Size: Metal uses warp size of 1 (vs CUDA's 32), affecting kernel scheduling.
- Optimization Flags: cuBLAS, CUTLASS, CUDA graphs, and FlashInfer are all disabled on Metal. Only TIR-based kernels are used.
- iOS Simulator: Use `prepare_libs.sh --simulator` for x86_64 simulator builds.