Implementation:Ggml org Llama cpp Android InferenceEngineImpl
| Knowledge Sources | |
|---|---|
| Domains | Android, Inference |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Singleton JNI wrapper implementation of the `InferenceEngine` interface that manages the full lifecycle of a llama.cpp model instance on Android.
Description
Uses a private constructor with thread-safe double-checked locking via `getInstance(Context)` to create a singleton. Declares `@FastNative` external JNI methods (`init`, `load`, `prepare`, `systemInfo`, `benchModel`, `processSystemPrompt`, `processUserPrompt`, `generateTokens`, `cleanUp`, `destroy`) that map to native C++ functions in `ai_chat.cpp`. Manages state transitions via a `MutableStateFlow<State>` and executes native operations on a dedicated single-threaded coroutine dispatcher for thread safety, exposing token generation as a Kotlin `Flow<String>`.
Usage
Use this class as the primary entry point for LLM inference on Android. Obtain the singleton via `getInstance(context)`, then load models, send prompts, and collect generated tokens as Kotlin Flow streams within coroutine scopes.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: examples/llama.android/lib/src/main/java/com/arm/aichat/internal/InferenceEngineImpl.kt
- Lines: 1-324
Signature
internal class InferenceEngineImpl private constructor(
private val nativeLibDir: String
) : InferenceEngine {
companion object {
internal fun getInstance(context: Context): InferenceEngine
}
// JNI native methods
@FastNative external fun init(nativeLibDir: String)
@FastNative external fun load(modelPath: String): Int
@FastNative external fun prepare()
@FastNative external fun systemInfo(): String
@FastNative external fun benchModel(pp: Int, tg: Int, pl: Int, nr: Int): String
@FastNative external fun processSystemPrompt(prompt: String)
@FastNative external fun processUserPrompt(prompt: String)
@FastNative external fun generateTokens(): String
@FastNative external fun cleanUp()
@FastNative external fun destroy()
// Kotlin API
suspend fun loadModel(modelPath: String)
fun sendUserPrompt(prompt: String): Flow<String>
val state: StateFlow<State>
}
Import
import android.content.Context
import com.arm.aichat.InferenceEngine
import dalvik.annotation.optimization.FastNative
import kotlinx.coroutines.flow.Flow
import kotlinx.coroutines.flow.MutableStateFlow
import kotlinx.coroutines.flow.StateFlow
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| context | Context | Yes | Android Context for obtaining the native library directory path |
| modelPath | String | Yes | Absolute path to the GGUF model file on the device |
| prompt | String | Yes | User prompt text to send for inference |
Outputs
| Name | Type | Description |
|---|---|---|
| state | StateFlow<State> | Observable state flow tracking engine lifecycle (Idle, Loading, Ready, Generating, etc.) |
| tokenFlow | Flow<String> | Kotlin Flow emitting generated tokens one at a time as strings |
| systemInfo | String | Backend and system capability information string |
Usage Examples
// Obtain singleton instance
val engine = InferenceEngineImpl.getInstance(applicationContext)
// Load a model
engine.loadModel("/data/local/tmp/model.gguf")
// Send a user prompt and collect generated tokens
engine.sendUserPrompt("What is the capital of France?")
.collect { token ->
print(token)
}
// Observe state transitions
engine.state.collect { state ->
when (state) {
is State.Idle -> { /* ready for loading */ }
is State.Ready -> { /* model loaded, ready for prompts */ }
is State.Generating -> { /* currently generating */ }
}
}