Implementation:Ggml org Llama cpp Android InferenceEngine
| Knowledge Sources | |
|---|---|
| Domains | Android, API |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Defines the public interface for the LLM inference engine, specifying all operations and state transitions for model loading, prompt processing, and token generation.
Description
Declares an `InferenceEngine` interface with methods: `loadModel` to load a GGUF model, `setSystemPrompt` for system instructions, `sendUserPrompt` returning a `Flow<String>` of generated tokens, `bench` for benchmarking, `cleanUp` to unload models, and `destroy` for full cleanup. Uses a sealed class `State` hierarchy with states like Uninitialized, Initializing, LoadingModel, ModelReady, Generating, and Error. Extension properties `isUninterruptible` and `isModelLoaded` provide convenient state checks.
Usage
Use this interface as the core abstraction layer for the Android AI Chat library that decouples the public API from the JNI implementation, enabling clean architecture and testability while defining the complete lifecycle contract for LLM inference operations.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: examples/llama.android/lib/src/main/java/com/arm/aichat/InferenceEngine.kt
- Lines: 1-89
Signature
interface InferenceEngine {
val state: StateFlow<State>
suspend fun loadModel(pathToModel: String)
suspend fun setSystemPrompt(systemPrompt: String)
fun sendUserPrompt(message: String, predictLength: Int = DEFAULT_PREDICT_LENGTH): Flow<String>
suspend fun bench(pp: Int, tg: Int, pl: Int, nr: Int = 1): String
fun cleanUp()
fun destroy()
sealed class State {
object Uninitialized : State()
object Initializing : State()
object Initialized : State()
object LoadingModel : State()
object UnloadingModel : State()
object ModelReady : State()
object Benchmarking : State()
object ProcessingSystemPrompt : State()
object ProcessingUserPrompt : State()
object Generating : State()
data class Error(val exception: Exception) : State()
}
}
val State.isUninterruptible: Boolean
val State.isModelLoaded: Boolean
class UnsupportedArchitectureException : Exception()
Import
import com.arm.aichat.InferenceEngine.State
import kotlinx.coroutines.flow.Flow
import kotlinx.coroutines.flow.StateFlow
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pathToModel | String | Yes | Filesystem path to the GGUF model file |
| systemPrompt | String | Yes | System prompt to configure model behavior |
| message | String | Yes | User prompt message to send to the model |
| predictLength | Int | No | Maximum number of tokens to generate (default 1024) |
| pp | Int | Yes | Prompt processing batch size for benchmarking |
| tg | Int | Yes | Token generation count for benchmarking |
| pl | Int | Yes | Pipeline length for benchmarking |
| nr | Int | No | Number of benchmark repetitions (default 1) |
Outputs
| Name | Type | Description |
|---|---|---|
| state | StateFlow<State> | Observable state flow representing the engine lifecycle state |
| sendUserPrompt return | Flow<String> | Stream of generated token strings |
| bench return | String | Formatted benchmark results string |
Usage Examples
// Load a model and generate tokens
val engine: InferenceEngine = AiChat.getInferenceEngine(context)
engine.loadModel("/path/to/model.gguf")
engine.setSystemPrompt("You are a helpful assistant.")
engine.sendUserPrompt("Hello, world!")
.collect { token ->
print(token)
}
// Check engine state
if (engine.state.value.isModelLoaded) {
// Model is ready for inference
}