Implementation:Ggml org Llama cpp Android InferenceEngine

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Android, API
Last Updated	2026-02-15 00:00 GMT

Overview

Defines the public interface for the LLM inference engine, specifying all operations and state transitions for model loading, prompt processing, and token generation.

Description

Declares an `InferenceEngine` interface with methods: `loadModel` to load a GGUF model, `setSystemPrompt` for system instructions, `sendUserPrompt` returning a `Flow<String>` of generated tokens, `bench` for benchmarking, `cleanUp` to unload models, and `destroy` for full cleanup. Uses a sealed class `State` hierarchy with states like Uninitialized, Initializing, LoadingModel, ModelReady, Generating, and Error. Extension properties `isUninterruptible` and `isModelLoaded` provide convenient state checks.

Usage

Use this interface as the core abstraction layer for the Android AI Chat library that decouples the public API from the JNI implementation, enabling clean architecture and testability while defining the complete lifecycle contract for LLM inference operations.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: examples/llama.android/lib/src/main/java/com/arm/aichat/InferenceEngine.kt
Lines: 1-89

Signature

interface InferenceEngine {
    val state: StateFlow<State>
    suspend fun loadModel(pathToModel: String)
    suspend fun setSystemPrompt(systemPrompt: String)
    fun sendUserPrompt(message: String, predictLength: Int = DEFAULT_PREDICT_LENGTH): Flow<String>
    suspend fun bench(pp: Int, tg: Int, pl: Int, nr: Int = 1): String
    fun cleanUp()
    fun destroy()

    sealed class State {
        object Uninitialized : State()
        object Initializing : State()
        object Initialized : State()
        object LoadingModel : State()
        object UnloadingModel : State()
        object ModelReady : State()
        object Benchmarking : State()
        object ProcessingSystemPrompt : State()
        object ProcessingUserPrompt : State()
        object Generating : State()
        data class Error(val exception: Exception) : State()
    }
}

val State.isUninterruptible: Boolean
val State.isModelLoaded: Boolean
class UnsupportedArchitectureException : Exception()

Import

import com.arm.aichat.InferenceEngine.State
import kotlinx.coroutines.flow.Flow
import kotlinx.coroutines.flow.StateFlow

I/O Contract

Inputs

Name	Type	Required	Description
pathToModel	String	Yes	Filesystem path to the GGUF model file
systemPrompt	String	Yes	System prompt to configure model behavior
message	String	Yes	User prompt message to send to the model
predictLength	Int	No	Maximum number of tokens to generate (default 1024)
pp	Int	Yes	Prompt processing batch size for benchmarking
tg	Int	Yes	Token generation count for benchmarking
pl	Int	Yes	Pipeline length for benchmarking
nr	Int	No	Number of benchmark repetitions (default 1)

Outputs

Name	Type	Description
state	StateFlow<State>	Observable state flow representing the engine lifecycle state
sendUserPrompt return	Flow<String>	Stream of generated token strings
bench return	String	Formatted benchmark results string

Usage Examples

// Load a model and generate tokens
val engine: InferenceEngine = AiChat.getInferenceEngine(context)
engine.loadModel("/path/to/model.gguf")
engine.setSystemPrompt("You are a helpful assistant.")

engine.sendUserPrompt("Hello, world!")
    .collect { token ->
        print(token)
    }

// Check engine state
if (engine.state.value.isModelLoaded) {
    // Model is ready for inference
}

Related Pages

Principle:Ggml_org_Llama_cpp_Android_Integration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment