Environment:Mlc ai Mlc llm Metal macOS iOS Environment

Knowledge Sources	MLC-LLM Apple Metal
Domains	Infrastructure, Mobile, GPU_Acceleration
Last Updated	2026-02-09 19:00 GMT

Overview

Apple Metal GPU environment for macOS desktop and iOS on-device LLM inference, using Xcode toolchain for compilation and static library linking.

Description

This environment enables LLM inference on Apple Silicon (M1/M2/M3/M4) and iOS devices via the Metal GPU backend. For macOS, models are compiled to `.dylib` shared libraries. For iOS, models are compiled to `.tar` static archives that are linked into Xcode projects via the MLCSwift framework. The Metal backend has a thread warp size of 1 and max shared memory of 32KB per block, which affects kernel scheduling. The environment uses TIR-based PagedKVCache (FlashInfer is not available on Metal).

Usage

Use this environment when deploying LLM models to macOS desktops or iOS devices (iPhone/iPad). It is required for the Mobile Deployment workflow and for running MLCEngine on Apple platforms.

System Requirements

Category	Requirement	Notes
OS	macOS 13+ (Ventura) / iOS 16+	Apple Silicon (M1+) recommended for macOS
Hardware	Apple GPU (Metal-capable)	M1/M2/M3/M4 for macOS; A14+ for iOS
Toolchain	Xcode 14+	Required for iOS builds and Metal shader compilation
Disk	5GB+	For model weights and compiled libraries

Dependencies

System Packages

`xcode` (Apple Xcode with Metal SDK)
`cmake` < 4.0
`git`

iOS Build Dependencies

MLCSwift framework (included in repository at `ios/MLCSwift/`)
`prepare_libs.sh` script (builds static libraries for device or simulator)

Python Packages

`apache-tvm-ffi` (TVM FFI bindings)
`torch` (for weight conversion)
`transformers`
`safetensors`

Credentials

No special credentials required. Apple Developer account needed for iOS device deployment.

Quick Install

# For macOS compilation
pip install mlc-llm

# For iOS: build static libraries
cd ios && ./prepare_libs.sh

# For iOS simulator target
cd ios && ./prepare_libs.sh --simulator

Code Evidence

Metal target preset from `auto_target.py:394-407`:

"iphone:generic": {
    "target": {
        "kind": "metal",
        "max_threads_per_block": 256,
        "max_shared_memory_per_block": 32768,
        "thread_warp_size": 1,
        "libs": ["iphoneos"],
        "host": {
            "kind": "llvm",
            "mtriple": "arm64-apple-darwin",
        },
    },
    "build": _build_iphone,
},

Metal KV cache capacity limit from `config.cc:746-751`:

if (device.device_type == DLDeviceType::kDLMetal) {
    // NOTE: Metal runtime has severe performance issues with large buffers.
    // To work around the issue, we limit the KV cache capacity to 32768.
    model_max_total_sequence_length =
        std::min(model_max_total_sequence_length, static_cast<int64_t>(32768));
}

iOS build function from `auto_target.py:161-182`:

def _build_iphone():
    @register_global_func("tvm_callback_metal_compile", override=True)
    def compile_metal(src, target):
        if target.libs:
            return xcode.compile_metal(src, sdk=target.libs[0])
        return xcode.compile_metal(src)

    def build(mod, args, pipeline=None):
        output = args.output
        mod = _add_system_lib_prefix(mod, args.system_lib_prefix, is_system_lib=True)
        assert output.suffix == ".tar"
        relax.build(mod, target=args.target, relax_pipeline=pipeline,
                    system_lib=True).export_library(str(output), fcompile=tar.tar)

    return build

Common Errors

Error Message	Cause	Solution
Metal shader compilation failure	Xcode SDK mismatch	Ensure Xcode Command Line Tools are installed: `xcode-select --install`
KV cache limited to 32768 tokens	Metal large buffer performance workaround	This is intentional; use CUDA for larger context windows
`--system-lib-prefix is not specified`	Missing prefix for static library build	Pass `--system-lib-prefix` flag or let auto-detection handle it

Compatibility Notes

KV Cache Limit: Metal runtime has severe performance issues with large buffers. MLC-LLM automatically caps KV cache capacity at 32768 tokens on Metal devices.
FlashInfer: Not available on Metal. The TIR-based PagedKVCache is used instead.
Thread Warp Size: Metal uses warp size of 1 (vs CUDA's 32), affecting kernel scheduling.
Optimization Flags: cuBLAS, CUTLASS, CUDA graphs, and FlashInfer are all disabled on Metal. Only TIR-based kernels are used.
iOS Simulator: Use `prepare_libs.sh --simulator` for x86_64 simulator builds.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment