Implementation:Ggml org Llama cpp Android InferenceEngineImpl

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Android, Inference
Last Updated	2026-02-15 00:00 GMT

Overview

Singleton JNI wrapper implementation of the `InferenceEngine` interface that manages the full lifecycle of a llama.cpp model instance on Android.

Description

Uses a private constructor with thread-safe double-checked locking via `getInstance(Context)` to create a singleton. Declares `@FastNative` external JNI methods (`init`, `load`, `prepare`, `systemInfo`, `benchModel`, `processSystemPrompt`, `processUserPrompt`, `generateTokens`, `cleanUp`, `destroy`) that map to native C++ functions in `ai_chat.cpp`. Manages state transitions via a `MutableStateFlow<State>` and executes native operations on a dedicated single-threaded coroutine dispatcher for thread safety, exposing token generation as a Kotlin `Flow<String>`.

Usage

Use this class as the primary entry point for LLM inference on Android. Obtain the singleton via `getInstance(context)`, then load models, send prompts, and collect generated tokens as Kotlin Flow streams within coroutine scopes.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: examples/llama.android/lib/src/main/java/com/arm/aichat/internal/InferenceEngineImpl.kt
Lines: 1-324

Signature

internal class InferenceEngineImpl private constructor(
    private val nativeLibDir: String
) : InferenceEngine {

    companion object {
        internal fun getInstance(context: Context): InferenceEngine
    }

    // JNI native methods
    @FastNative external fun init(nativeLibDir: String)
    @FastNative external fun load(modelPath: String): Int
    @FastNative external fun prepare()
    @FastNative external fun systemInfo(): String
    @FastNative external fun benchModel(pp: Int, tg: Int, pl: Int, nr: Int): String
    @FastNative external fun processSystemPrompt(prompt: String)
    @FastNative external fun processUserPrompt(prompt: String)
    @FastNative external fun generateTokens(): String
    @FastNative external fun cleanUp()
    @FastNative external fun destroy()

    // Kotlin API
    suspend fun loadModel(modelPath: String)
    fun sendUserPrompt(prompt: String): Flow<String>
    val state: StateFlow<State>
}

Import

import android.content.Context
import com.arm.aichat.InferenceEngine
import dalvik.annotation.optimization.FastNative
import kotlinx.coroutines.flow.Flow
import kotlinx.coroutines.flow.MutableStateFlow
import kotlinx.coroutines.flow.StateFlow

I/O Contract

Inputs

Name	Type	Required	Description
context	Context	Yes	Android Context for obtaining the native library directory path
modelPath	String	Yes	Absolute path to the GGUF model file on the device
prompt	String	Yes	User prompt text to send for inference

Outputs

Name	Type	Description
state	StateFlow<State>	Observable state flow tracking engine lifecycle (Idle, Loading, Ready, Generating, etc.)
tokenFlow	Flow<String>	Kotlin Flow emitting generated tokens one at a time as strings
systemInfo	String	Backend and system capability information string

Usage Examples

// Obtain singleton instance
val engine = InferenceEngineImpl.getInstance(applicationContext)

// Load a model
engine.loadModel("/data/local/tmp/model.gguf")

// Send a user prompt and collect generated tokens
engine.sendUserPrompt("What is the capital of France?")
    .collect { token ->
        print(token)
    }

// Observe state transitions
engine.state.collect { state ->
    when (state) {
        is State.Idle -> { /* ready for loading */ }
        is State.Ready -> { /* model loaded, ready for prompts */ }
        is State.Generating -> { /* currently generating */ }
    }
}

Related Pages

Principle:Ggml_org_Llama_cpp_Android_Inference

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment