Implementation:Mlc ai Mlc llm MLCEngine Mobile
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Mobile_Deployment |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tools for integrating compiled LLM inference engines into native mobile applications via platform-specific SDK bindings provided by MLC-LLM.
Description
MLC-LLM provides two platform-specific engine implementations that bridge the C++ inference runtime with native mobile application code:
Android (Kotlin): The MLCEngine class is the primary entry point for Android applications. It wraps the JSONFFIEngine Java class (which communicates with the C++ engine via TVM's JNI bridge) and provides a high-level, coroutine-based API following the OpenAI Chat Completions protocol. Upon initialization, it starts two background worker threads: one for the inference loop and one for the stream-back loop. The chat.completions.create() method accepts a ChatCompletionRequest and returns a Kotlin ReceiveChannel of streaming ChatCompletionStreamResponse objects.
iOS (Objective-C): The JSONFFIEngine class is an Objective-C interface that exposes the C++ JSON FFI engine to Swift code. It provides methods for engine lifecycle management (initBackgroundEngine:, reload:, unload, reset), inference (chatCompletion:requestID:), request cancellation (abort:), and background loop management (runBackgroundLoop, runBackgroundStreamBackLoop, exitBackgroundLoop). Streaming results are delivered via a callback block registered during initBackgroundEngine:.
Usage
Use these APIs when:
- Building a new Android application with on-device LLM chat capabilities
- Building a new iOS application with on-device LLM inference
- Implementing custom UI flows around streaming text generation
- Integrating MLC-LLM into an existing mobile application as an AI feature
Code Reference
Source Location
- Repository: MLC-LLM
- File (Android MLCEngine):
android/mlc4j/src/main/java/ai/mlc/mlcllm/MLCEngine.kt(Lines 24-74) - File (Android JSONFFIEngine):
android/mlc4j/src/main/java/ai/mlc/mlcllm/JSONFFIEngine.java(Lines 9-87) - File (iOS JSONFFIEngine):
ios/MLCSwift/Sources/ObjC/include/LLMEngine.h(Lines 12-32)
Signature (Android - MLCEngine Kotlin)
class MLCEngine {
val chat: Chat
fun reload(modelPath: String, modelLib: String)
fun reset()
fun unload()
}
class Chat(
private val jsonFFIEngine: JSONFFIEngine,
private val state: EngineState
) {
val completions: Completions
}
class Completions(
private val jsonFFIEngine: JSONFFIEngine,
private val state: EngineState
) {
suspend fun create(
request: ChatCompletionRequest
): ReceiveChannel<ChatCompletionStreamResponse>
suspend fun create(
messages: List<ChatCompletionMessage>,
model: String? = null,
frequency_penalty: Float? = null,
presence_penalty: Float? = null,
logprobs: Boolean = false,
top_logprobs: Int = 0,
logit_bias: Map<Int, Float>? = null,
max_tokens: Int? = null,
n: Int = 1,
seed: Int? = null,
stop: List<String>? = null,
stream: Boolean = true,
stream_options: StreamOptions? = null,
temperature: Float? = null,
top_p: Float? = null,
tools: List<ChatTool>? = null,
user: String? = null,
response_format: ResponseFormat? = null
): ReceiveChannel<ChatCompletionStreamResponse>
}
Signature (Android - JSONFFIEngine Java)
public class JSONFFIEngine {
public JSONFFIEngine()
public void initBackgroundEngine(KotlinFunction callback)
public void reload(String engineConfigJSONStr)
public void chatCompletion(String requestJSONStr, String requestId)
public void runBackgroundLoop()
public void runBackgroundStreamBackLoop()
public void exitBackgroundLoop()
public void unload()
public void reset()
public interface KotlinFunction {
void invoke(String arg);
}
}
Signature (iOS - JSONFFIEngine Objective-C)
@interface JSONFFIEngine : NSObject
- (void)initBackgroundEngine:(void (^)(NSString*))streamCallback;
- (void)reload:(NSString*)engineConfig;
- (void)unload;
- (void)reset;
- (void)chatCompletion:(NSString*)requestJSON requestID:(NSString*)requestID;
- (void)abort:(NSString*)requestID;
- (void)runBackgroundLoop;
- (void)runBackgroundStreamBackLoop;
- (void)exitBackgroundLoop;
@end
I/O Contract
Inputs (MLCEngine.reload)
| Name | Type | Required | Description |
|---|---|---|---|
modelPath |
String | Yes | Local file path to the model weight directory on the device |
modelLib |
String | Yes | System library name identifying the compiled model library (e.g., "Llama-3.2-3B-Instruct-q4f16_1-MLC")
|
Inputs (Completions.create)
| Name | Type | Required | Description |
|---|---|---|---|
messages |
List<ChatCompletionMessage> | Yes | List of conversation messages following the OpenAI Chat Completions format |
model |
String? | No | Model identifier (optional when a model is already loaded) |
temperature |
Float? | No | Sampling temperature (higher values produce more random output) |
top_p |
Float? | No | Nucleus sampling parameter |
max_tokens |
Int? | No | Maximum number of tokens to generate |
stream |
Boolean | No | Must be true (only streaming mode is supported on mobile)
|
stop |
List<String>? | No | Stop sequences that halt generation |
tools |
List<ChatTool>? | No | Tool definitions for function calling |
Inputs (iOS chatCompletion)
| Name | Type | Required | Description |
|---|---|---|---|
requestJSON |
NSString* | Yes | JSON-serialized chat completion request following the OpenAI protocol |
requestID |
NSString* | Yes | Unique request identifier for routing streaming responses |
Outputs
| Name | Type | Description |
|---|---|---|
ReceiveChannel<ChatCompletionStreamResponse> (Android) |
Kotlin Channel | Asynchronous stream of partial chat completion responses, each containing a delta with generated token(s) |
| Stream callback (iOS) | Block (void (^)(NSString*)) |
Callback invoked with JSON-serialized streaming response chunks |
Usage Examples
Android: Load and Chat
import ai.mlc.mlcllm.MLCEngine
import ai.mlc.mlcllm.OpenAIProtocol.*
import kotlinx.coroutines.runBlocking
// Create the engine (starts background threads automatically)
val engine = MLCEngine()
// Load a model
engine.reload(
modelPath = "/data/local/tmp/Llama-3.2-3B-Instruct-q4f16_1-MLC",
modelLib = "Llama-3.2-3B-Instruct-q4f16_1-MLC"
)
// Send a chat completion request
runBlocking {
val channel = engine.chat.completions.create(
messages = listOf(
ChatCompletionMessage(
role = "user",
content = "What is machine learning?"
)
),
temperature = 0.7f,
max_tokens = 256
)
// Consume the streaming response
for (response in channel) {
response.choices.forEach { choice ->
choice.delta?.content?.let { token ->
print(token)
}
}
}
}
// Clean up
engine.unload()
iOS: Initialize and Reload
// Create the FFI engine
JSONFFIEngine *engine = [[JSONFFIEngine alloc] init];
// Initialize with a stream callback
[engine initBackgroundEngine:^(NSString *response) {
// Handle streaming JSON response chunks
NSLog(@"Stream response: %@", response);
}];
// Start the background loops (on separate threads)
dispatch_async(dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0), ^{
[engine runBackgroundLoop];
});
dispatch_async(dispatch_get_global_queue(QOS_CLASS_BACKGROUND, 0), ^{
[engine runBackgroundStreamBackLoop];
});
// Load a model
NSString *config = @"{\"model\": \"/path/to/model\", "
"\"model_lib\": \"system://model-lib-name\", "
"\"mode\": \"interactive\"}";
[engine reload:config];
// Send a chat completion request
NSString *request = @"{\"messages\": [{\"role\": \"user\", "
"\"content\": \"Hello!\"}], \"stream\": true}";
[engine chatCompletion:request requestID:@"unique-request-id"];