Implementation:Neuml Txtai GGML ANN
| Knowledge Sources | |
|---|---|
| Domains | Vector_Search, GPU_Computing |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
GGML is a GPU-accelerated approximate nearest neighbor (ANN) index that stores embeddings as GGML tensors with optional quantization for reduced memory usage and faster computation.
Description
The GGML class inherits from ANN and provides a vector similarity search backend powered by the GGML tensor library. It creates matrix multiplication compute graphs on GPU or CPU backends to perform batched cosine similarity search across stored embeddings. The class delegates actual tensor management to a companion GGMLTensors helper class (line 101), which handles GGML context creation, backend initialization, tensor allocation, quantization, and GGUF file serialization.
GGML supports multiple quantization formats (Q4_0, Q8_0, F32, etc.) through the quantize configuration parameter, enabling significant memory savings at the cost of minor precision loss. The backend automatically falls back from GPU to CPU when no accelerated backend is available.
Usage
Use the GGML backend when you need a self-contained, file-based ANN index with optional GPU acceleration and quantization. It is well-suited for environments where PostgreSQL or SQLite are not available, and where you want to leverage GGML's efficient tensor operations and GGUF file format for persistence. Configure quantize to reduce memory footprint for large embedding collections.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File: src/python/txtai/ann/dense/ggml.py
- Lines: 1-570
Signature
class GGML(ANN):
"""
Builds an ANN index backed by GGML.
"""
def __init__(self, config):
super().__init__(config)
if not LIBGGML:
raise ImportError('GGML is not available - install "ann" extra to enable')
Import
from txtai.ann.dense import GGML
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | dict | Yes | Index configuration dictionary containing backend settings, dimensions, and optional keys such as gpu (bool), querysize (int), and quantize (bool/int/str)
|
Outputs
| Name | Type | Description |
|---|---|---|
| self.backend | GGMLTensors | Internal tensor manager handling GGML context, buffers, and compute graph |
| self.config | dict | Updated configuration with offset, build, and update metadata after indexing
|
Key Methods
load(self, path)
Creates a new GGMLTensors backend and loads an existing GGUF file from the given path. The tensors, queries buffer, and compute graph are restored from the serialized data.
index(self, embeddings)
Creates a new GGMLTensors backend and indexes the provided embeddings numpy array. Sets the offset in config to the number of indexed embeddings and records build metadata including the GGML version.
append(self, embeddings)
Appends new embeddings to the existing tensor data by merging old and new tensors into a fresh GGML context. Updates the offset in config accordingly.
delete(self, ids)
Marks the given ids as deleted. Deleted rows are zeroed out during search rather than physically removed from the tensor.
search(self, queries, limit)
Runs batched matrix multiplication of queries against the stored embeddings tensor. Results are sorted by descending score and the top limit results are returned as a list of [(id, score)] tuples per query.
count(self)
Returns the number of active (non-deleted) embeddings in the index.
save(self, path)
Saves the embeddings data and any delete list as a GGUF file at the specified path.
close(self)
Frees all GGML resources (buffers, allocator, backend, context) and sets the backend to None.
GGMLTensors Helper Class
The GGMLTensors class (line 101) manages the low-level GGML tensor operations including:
- createcontext() - Allocates a GGML context with space for tensor and graph overhead
- createbackend() - Initializes a GPU backend (falling back to CPU with thread count set to
os.cpu_count()) - createtensors(data) - Creates query and data tensors with optional quantization
- creategraph() - Builds a matrix multiplication compute graph (
ggml_mul_mat) - tensortype(data) - Maps the quantization setting to a GGML data type constant
- copy(inputs, outputs, tensortype, offset) - Copies and optionally quantizes data into backend tensors
- chunk(queries) - Splits queries into batches of
querysize
Usage Examples
Basic Usage
import numpy as np
from txtai.ann.dense import GGML
# Configuration for GGML backend
config = {
"backend": "ggml",
"dimensions": 384,
"ggml": {
"gpu": True,
"querysize": 64,
"quantize": "Q8_0"
}
}
# Create and build the index
ann = GGML(config)
embeddings = np.random.rand(1000, 384).astype(np.float32)
ann.index(embeddings)
# Search for similar vectors
queries = np.random.rand(2, 384).astype(np.float32)
results = ann.search(queries, limit=10)
# results: list of [(id, score), ...] per query
# Save and reload
ann.save("/tmp/index.gguf")
ann2 = GGML(config)
ann2.load("/tmp/index.gguf")
# Append new embeddings
new_embeddings = np.random.rand(100, 384).astype(np.float32)
ann2.append(new_embeddings)
# Get count
print(ann2.count()) # 1100
# Cleanup
ann2.close()