Implementation:Neuml Txtai GGML ANN

Knowledge Sources	Neuml_Txtai
Domains	Vector_Search, GPU_Computing
Last Updated	2026-02-09 17:00 GMT

Overview

GGML is a GPU-accelerated approximate nearest neighbor (ANN) index that stores embeddings as GGML tensors with optional quantization for reduced memory usage and faster computation.

Description

The GGML class inherits from ANN and provides a vector similarity search backend powered by the GGML tensor library. It creates matrix multiplication compute graphs on GPU or CPU backends to perform batched cosine similarity search across stored embeddings. The class delegates actual tensor management to a companion GGMLTensors helper class (line 101), which handles GGML context creation, backend initialization, tensor allocation, quantization, and GGUF file serialization.

GGML supports multiple quantization formats (Q4_0, Q8_0, F32, etc.) through the quantize configuration parameter, enabling significant memory savings at the cost of minor precision loss. The backend automatically falls back from GPU to CPU when no accelerated backend is available.

Usage

Use the GGML backend when you need a self-contained, file-based ANN index with optional GPU acceleration and quantization. It is well-suited for environments where PostgreSQL or SQLite are not available, and where you want to leverage GGML's efficient tensor operations and GGUF file format for persistence. Configure quantize to reduce memory footprint for large embedding collections.

Code Reference

Source Location

Repository: Neuml_Txtai
File: src/python/txtai/ann/dense/ggml.py
Lines: 1-570

Signature

class GGML(ANN):
    """
    Builds an ANN index backed by GGML.
    """

    def __init__(self, config):
        super().__init__(config)

        if not LIBGGML:
            raise ImportError('GGML is not available - install "ann" extra to enable')

Import

from txtai.ann.dense import GGML

I/O Contract

Inputs

Name	Type	Required	Description
config	dict	Yes	Index configuration dictionary containing backend settings, dimensions, and optional keys such as `gpu` (bool), `querysize` (int), and `quantize` (bool/int/str)

Outputs

Name	Type	Description
self.backend	GGMLTensors	Internal tensor manager handling GGML context, buffers, and compute graph
self.config	dict	Updated configuration with `offset`, `build`, and `update` metadata after indexing

Key Methods

load(self, path)

Creates a new GGMLTensors backend and loads an existing GGUF file from the given path. The tensors, queries buffer, and compute graph are restored from the serialized data.

index(self, embeddings)

Creates a new GGMLTensors backend and indexes the provided embeddings numpy array. Sets the offset in config to the number of indexed embeddings and records build metadata including the GGML version.

append(self, embeddings)

Appends new embeddings to the existing tensor data by merging old and new tensors into a fresh GGML context. Updates the offset in config accordingly.

delete(self, ids)

Marks the given ids as deleted. Deleted rows are zeroed out during search rather than physically removed from the tensor.

search(self, queries, limit)

Runs batched matrix multiplication of queries against the stored embeddings tensor. Results are sorted by descending score and the top limit results are returned as a list of [(id, score)] tuples per query.

count(self)

Returns the number of active (non-deleted) embeddings in the index.

save(self, path)

Saves the embeddings data and any delete list as a GGUF file at the specified path.

close(self)

Frees all GGML resources (buffers, allocator, backend, context) and sets the backend to None.

GGMLTensors Helper Class

The GGMLTensors class (line 101) manages the low-level GGML tensor operations including:

createcontext() - Allocates a GGML context with space for tensor and graph overhead
createbackend() - Initializes a GPU backend (falling back to CPU with thread count set to os.cpu_count())
createtensors(data) - Creates query and data tensors with optional quantization
creategraph() - Builds a matrix multiplication compute graph (ggml_mul_mat)
tensortype(data) - Maps the quantization setting to a GGML data type constant
copy(inputs, outputs, tensortype, offset) - Copies and optionally quantizes data into backend tensors
chunk(queries) - Splits queries into batches of querysize

Usage Examples

Basic Usage

import numpy as np
from txtai.ann.dense import GGML

# Configuration for GGML backend
config = {
    "backend": "ggml",
    "dimensions": 384,
    "ggml": {
        "gpu": True,
        "querysize": 64,
        "quantize": "Q8_0"
    }
}

# Create and build the index
ann = GGML(config)
embeddings = np.random.rand(1000, 384).astype(np.float32)
ann.index(embeddings)

# Search for similar vectors
queries = np.random.rand(2, 384).astype(np.float32)
results = ann.search(queries, limit=10)
# results: list of [(id, score), ...] per query

# Save and reload
ann.save("/tmp/index.gguf")
ann2 = GGML(config)
ann2.load("/tmp/index.gguf")

# Append new embeddings
new_embeddings = np.random.rand(100, 384).astype(np.float32)
ann2.append(new_embeddings)

# Get count
print(ann2.count())  # 1100

# Cleanup
ann2.close()

Related Pages

Principle:Neuml_Txtai_ANN_Backend_Architecture

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment