Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Mlc ai Mlc llm MLCEngine Mobile

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Mobile_Deployment
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tools for integrating compiled LLM inference engines into native mobile applications via platform-specific SDK bindings provided by MLC-LLM.

Description

MLC-LLM provides two platform-specific engine implementations that bridge the C++ inference runtime with native mobile application code:

Android (Kotlin): The MLCEngine class is the primary entry point for Android applications. It wraps the JSONFFIEngine Java class (which communicates with the C++ engine via TVM's JNI bridge) and provides a high-level, coroutine-based API following the OpenAI Chat Completions protocol. Upon initialization, it starts two background worker threads: one for the inference loop and one for the stream-back loop. The chat.completions.create() method accepts a ChatCompletionRequest and returns a Kotlin ReceiveChannel of streaming ChatCompletionStreamResponse objects.

iOS (Objective-C): The JSONFFIEngine class is an Objective-C interface that exposes the C++ JSON FFI engine to Swift code. It provides methods for engine lifecycle management (initBackgroundEngine:, reload:, unload, reset), inference (chatCompletion:requestID:), request cancellation (abort:), and background loop management (runBackgroundLoop, runBackgroundStreamBackLoop, exitBackgroundLoop). Streaming results are delivered via a callback block registered during initBackgroundEngine:.

Usage

Use these APIs when:

  • Building a new Android application with on-device LLM chat capabilities
  • Building a new iOS application with on-device LLM inference
  • Implementing custom UI flows around streaming text generation
  • Integrating MLC-LLM into an existing mobile application as an AI feature

Code Reference

Source Location

  • Repository: MLC-LLM
  • File (Android MLCEngine): android/mlc4j/src/main/java/ai/mlc/mlcllm/MLCEngine.kt (Lines 24-74)
  • File (Android JSONFFIEngine): android/mlc4j/src/main/java/ai/mlc/mlcllm/JSONFFIEngine.java (Lines 9-87)
  • File (iOS JSONFFIEngine): ios/MLCSwift/Sources/ObjC/include/LLMEngine.h (Lines 12-32)

Signature (Android - MLCEngine Kotlin)

class MLCEngine {
    val chat: Chat

    fun reload(modelPath: String, modelLib: String)
    fun reset()
    fun unload()
}

class Chat(
    private val jsonFFIEngine: JSONFFIEngine,
    private val state: EngineState
) {
    val completions: Completions
}

class Completions(
    private val jsonFFIEngine: JSONFFIEngine,
    private val state: EngineState
) {
    suspend fun create(
        request: ChatCompletionRequest
    ): ReceiveChannel<ChatCompletionStreamResponse>

    suspend fun create(
        messages: List<ChatCompletionMessage>,
        model: String? = null,
        frequency_penalty: Float? = null,
        presence_penalty: Float? = null,
        logprobs: Boolean = false,
        top_logprobs: Int = 0,
        logit_bias: Map<Int, Float>? = null,
        max_tokens: Int? = null,
        n: Int = 1,
        seed: Int? = null,
        stop: List<String>? = null,
        stream: Boolean = true,
        stream_options: StreamOptions? = null,
        temperature: Float? = null,
        top_p: Float? = null,
        tools: List<ChatTool>? = null,
        user: String? = null,
        response_format: ResponseFormat? = null
    ): ReceiveChannel<ChatCompletionStreamResponse>
}

Signature (Android - JSONFFIEngine Java)

public class JSONFFIEngine {
    public JSONFFIEngine()
    public void initBackgroundEngine(KotlinFunction callback)
    public void reload(String engineConfigJSONStr)
    public void chatCompletion(String requestJSONStr, String requestId)
    public void runBackgroundLoop()
    public void runBackgroundStreamBackLoop()
    public void exitBackgroundLoop()
    public void unload()
    public void reset()

    public interface KotlinFunction {
        void invoke(String arg);
    }
}

Signature (iOS - JSONFFIEngine Objective-C)

@interface JSONFFIEngine : NSObject

- (void)initBackgroundEngine:(void (^)(NSString*))streamCallback;
- (void)reload:(NSString*)engineConfig;
- (void)unload;
- (void)reset;
- (void)chatCompletion:(NSString*)requestJSON requestID:(NSString*)requestID;
- (void)abort:(NSString*)requestID;
- (void)runBackgroundLoop;
- (void)runBackgroundStreamBackLoop;
- (void)exitBackgroundLoop;

@end

I/O Contract

Inputs (MLCEngine.reload)

Name Type Required Description
modelPath String Yes Local file path to the model weight directory on the device
modelLib String Yes System library name identifying the compiled model library (e.g., "Llama-3.2-3B-Instruct-q4f16_1-MLC")

Inputs (Completions.create)

Name Type Required Description
messages List<ChatCompletionMessage> Yes List of conversation messages following the OpenAI Chat Completions format
model String? No Model identifier (optional when a model is already loaded)
temperature Float? No Sampling temperature (higher values produce more random output)
top_p Float? No Nucleus sampling parameter
max_tokens Int? No Maximum number of tokens to generate
stream Boolean No Must be true (only streaming mode is supported on mobile)
stop List<String>? No Stop sequences that halt generation
tools List<ChatTool>? No Tool definitions for function calling

Inputs (iOS chatCompletion)

Name Type Required Description
requestJSON NSString* Yes JSON-serialized chat completion request following the OpenAI protocol
requestID NSString* Yes Unique request identifier for routing streaming responses

Outputs

Name Type Description
ReceiveChannel<ChatCompletionStreamResponse> (Android) Kotlin Channel Asynchronous stream of partial chat completion responses, each containing a delta with generated token(s)
Stream callback (iOS) Block (void (^)(NSString*)) Callback invoked with JSON-serialized streaming response chunks

Usage Examples

Android: Load and Chat

import ai.mlc.mlcllm.MLCEngine
import ai.mlc.mlcllm.OpenAIProtocol.*
import kotlinx.coroutines.runBlocking

// Create the engine (starts background threads automatically)
val engine = MLCEngine()

// Load a model
engine.reload(
    modelPath = "/data/local/tmp/Llama-3.2-3B-Instruct-q4f16_1-MLC",
    modelLib = "Llama-3.2-3B-Instruct-q4f16_1-MLC"
)

// Send a chat completion request
runBlocking {
    val channel = engine.chat.completions.create(
        messages = listOf(
            ChatCompletionMessage(
                role = "user",
                content = "What is machine learning?"
            )
        ),
        temperature = 0.7f,
        max_tokens = 256
    )

    // Consume the streaming response
    for (response in channel) {
        response.choices.forEach { choice ->
            choice.delta?.content?.let { token ->
                print(token)
            }
        }
    }
}

// Clean up
engine.unload()

iOS: Initialize and Reload

// Create the FFI engine
JSONFFIEngine *engine = [[JSONFFIEngine alloc] init];

// Initialize with a stream callback
[engine initBackgroundEngine:^(NSString *response) {
    // Handle streaming JSON response chunks
    NSLog(@"Stream response: %@", response);
}];

// Start the background loops (on separate threads)
dispatch_async(dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0), ^{
    [engine runBackgroundLoop];
});
dispatch_async(dispatch_get_global_queue(QOS_CLASS_BACKGROUND, 0), ^{
    [engine runBackgroundStreamBackLoop];
});

// Load a model
NSString *config = @"{\"model\": \"/path/to/model\", "
                    "\"model_lib\": \"system://model-lib-name\", "
                    "\"mode\": \"interactive\"}";
[engine reload:config];

// Send a chat completion request
NSString *request = @"{\"messages\": [{\"role\": \"user\", "
                     "\"content\": \"Hello!\"}], \"stream\": true}";
[engine chatCompletion:request requestID:@"unique-request-id"];

Related Pages

Implements Principle

Environment and Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment