Implementation:Mlc ai Mlc llm MLCEngine Mobile

Knowledge Sources	MLC-LLM
Domains	Deep_Learning, Mobile_Deployment
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tools for integrating compiled LLM inference engines into native mobile applications via platform-specific SDK bindings provided by MLC-LLM.

Description

MLC-LLM provides two platform-specific engine implementations that bridge the C++ inference runtime with native mobile application code:

Android (Kotlin): The MLCEngine class is the primary entry point for Android applications. It wraps the JSONFFIEngine Java class (which communicates with the C++ engine via TVM's JNI bridge) and provides a high-level, coroutine-based API following the OpenAI Chat Completions protocol. Upon initialization, it starts two background worker threads: one for the inference loop and one for the stream-back loop. The chat.completions.create() method accepts a ChatCompletionRequest and returns a Kotlin ReceiveChannel of streaming ChatCompletionStreamResponse objects.

iOS (Objective-C): The JSONFFIEngine class is an Objective-C interface that exposes the C++ JSON FFI engine to Swift code. It provides methods for engine lifecycle management (initBackgroundEngine:, reload:, unload, reset), inference (chatCompletion:requestID:), request cancellation (abort:), and background loop management (runBackgroundLoop, runBackgroundStreamBackLoop, exitBackgroundLoop). Streaming results are delivered via a callback block registered during initBackgroundEngine:.

Usage

Use these APIs when:

Building a new Android application with on-device LLM chat capabilities
Building a new iOS application with on-device LLM inference
Implementing custom UI flows around streaming text generation
Integrating MLC-LLM into an existing mobile application as an AI feature

Code Reference

Source Location

Repository: MLC-LLM
File (Android MLCEngine): android/mlc4j/src/main/java/ai/mlc/mlcllm/MLCEngine.kt (Lines 24-74)
File (Android JSONFFIEngine): android/mlc4j/src/main/java/ai/mlc/mlcllm/JSONFFIEngine.java (Lines 9-87)
File (iOS JSONFFIEngine): ios/MLCSwift/Sources/ObjC/include/LLMEngine.h (Lines 12-32)

Signature (Android - MLCEngine Kotlin)

class MLCEngine {
    val chat: Chat

    fun reload(modelPath: String, modelLib: String)
    fun reset()
    fun unload()
}

class Chat(
    private val jsonFFIEngine: JSONFFIEngine,
    private val state: EngineState
) {
    val completions: Completions
}

class Completions(
    private val jsonFFIEngine: JSONFFIEngine,
    private val state: EngineState
) {
    suspend fun create(
        request: ChatCompletionRequest
    ): ReceiveChannel<ChatCompletionStreamResponse>

    suspend fun create(
        messages: List<ChatCompletionMessage>,
        model: String? = null,
        frequency_penalty: Float? = null,
        presence_penalty: Float? = null,
        logprobs: Boolean = false,
        top_logprobs: Int = 0,
        logit_bias: Map<Int, Float>? = null,
        max_tokens: Int? = null,
        n: Int = 1,
        seed: Int? = null,
        stop: List<String>? = null,
        stream: Boolean = true,
        stream_options: StreamOptions? = null,
        temperature: Float? = null,
        top_p: Float? = null,
        tools: List<ChatTool>? = null,
        user: String? = null,
        response_format: ResponseFormat? = null
    ): ReceiveChannel<ChatCompletionStreamResponse>
}

Signature (Android - JSONFFIEngine Java)

public class JSONFFIEngine {
    public JSONFFIEngine()
    public void initBackgroundEngine(KotlinFunction callback)
    public void reload(String engineConfigJSONStr)
    public void chatCompletion(String requestJSONStr, String requestId)
    public void runBackgroundLoop()
    public void runBackgroundStreamBackLoop()
    public void exitBackgroundLoop()
    public void unload()
    public void reset()

    public interface KotlinFunction {
        void invoke(String arg);
    }
}

Signature (iOS - JSONFFIEngine Objective-C)

@interface JSONFFIEngine : NSObject

- (void)initBackgroundEngine:(void (^)(NSString*))streamCallback;
- (void)reload:(NSString*)engineConfig;
- (void)unload;
- (void)reset;
- (void)chatCompletion:(NSString*)requestJSON requestID:(NSString*)requestID;
- (void)abort:(NSString*)requestID;
- (void)runBackgroundLoop;
- (void)runBackgroundStreamBackLoop;
- (void)exitBackgroundLoop;

@end

I/O Contract

Inputs (MLCEngine.reload)

Name	Type	Required	Description
`modelPath`	String	Yes	Local file path to the model weight directory on the device
`modelLib`	String	Yes	System library name identifying the compiled model library (e.g., `"Llama-3.2-3B-Instruct-q4f16_1-MLC"`)

Inputs (Completions.create)

Name	Type	Required	Description
`messages`	List<ChatCompletionMessage>	Yes	List of conversation messages following the OpenAI Chat Completions format
`model`	String?	No	Model identifier (optional when a model is already loaded)
`temperature`	Float?	No	Sampling temperature (higher values produce more random output)
`top_p`	Float?	No	Nucleus sampling parameter
`max_tokens`	Int?	No	Maximum number of tokens to generate
`stream`	Boolean	No	Must be `true` (only streaming mode is supported on mobile)
`stop`	List<String>?	No	Stop sequences that halt generation
`tools`	List<ChatTool>?	No	Tool definitions for function calling

Inputs (iOS chatCompletion)

Name	Type	Required	Description
`requestJSON`	NSString*	Yes	JSON-serialized chat completion request following the OpenAI protocol
`requestID`	NSString*	Yes	Unique request identifier for routing streaming responses

Outputs

Name	Type	Description
`ReceiveChannel<ChatCompletionStreamResponse>` (Android)	Kotlin Channel	Asynchronous stream of partial chat completion responses, each containing a delta with generated token(s)
Stream callback (iOS)	Block `(void (^)(NSString*))`	Callback invoked with JSON-serialized streaming response chunks

Usage Examples

Android: Load and Chat

import ai.mlc.mlcllm.MLCEngine
import ai.mlc.mlcllm.OpenAIProtocol.*
import kotlinx.coroutines.runBlocking

// Create the engine (starts background threads automatically)
val engine = MLCEngine()

// Load a model
engine.reload(
    modelPath = "/data/local/tmp/Llama-3.2-3B-Instruct-q4f16_1-MLC",
    modelLib = "Llama-3.2-3B-Instruct-q4f16_1-MLC"
)

// Send a chat completion request
runBlocking {
    val channel = engine.chat.completions.create(
        messages = listOf(
            ChatCompletionMessage(
                role = "user",
                content = "What is machine learning?"
            )
        ),
        temperature = 0.7f,
        max_tokens = 256
    )

    // Consume the streaming response
    for (response in channel) {
        response.choices.forEach { choice ->
            choice.delta?.content?.let { token ->
                print(token)
            }
        }
    }
}

// Clean up
engine.unload()

iOS: Initialize and Reload

// Create the FFI engine
JSONFFIEngine *engine = [[JSONFFIEngine alloc] init];

// Initialize with a stream callback
[engine initBackgroundEngine:^(NSString *response) {
    // Handle streaming JSON response chunks
    NSLog(@"Stream response: %@", response);
}];

// Start the background loops (on separate threads)
dispatch_async(dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0), ^{
    [engine runBackgroundLoop];
});
dispatch_async(dispatch_get_global_queue(QOS_CLASS_BACKGROUND, 0), ^{
    [engine runBackgroundStreamBackLoop];
});

// Load a model
NSString *config = @"{\"model\": \"/path/to/model\", "
                    "\"model_lib\": \"system://model-lib-name\", "
                    "\"mode\": \"interactive\"}";
[engine reload:config];

// Send a chat completion request
NSString *request = @"{\"messages\": [{\"role\": \"user\", "
                     "\"content\": \"Hello!\"}], \"stream\": true}";
[engine chatCompletion:request requestID:@"unique-request-id"];

Related Pages

Implements Principle

Principle:Mlc_ai_Mlc_llm_Native_Application_Building

Environment and Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment