Implementation:Neuml Txtai Llama Vectors

Knowledge Sources	Neuml_Txtai
Domains	Embeddings, Vectors, GGUF, Local Inference
Last Updated	2026-02-10 01:00 GMT

Overview

Concrete tool for building embedding vectors using llama.cpp (GGUF models) provided by txtai.

Description

The LlamaCpp class extends the base Vectors class to generate embeddings using the llama.cpp library via the llama-cpp-python binding. This enables running quantized GGUF-format embedding models efficiently on CPU or GPU without requiring PyTorch.

Key features:

Model detection: The static ismodel method identifies llama.cpp models by checking if the path ends with ".gguf" (case-insensitive).
Hugging Face Hub download: If the model path is not a local file, download splits it into repo ID and filename components and downloads it from the HF Hub via hf_hub_download.
Configurable model parameters: The loadmodel method accepts additional parameters via the vectors config key and sets sensible defaults:
- n_ctx: Defaults to maxlength config value or 0 (which uses the model's training context length).
- n_batch: Defaults to encodebatch config value or 64.
- n_gpu_layers: Defaults to -1 (all layers on GPU) when GPU is enabled, or 0 when disabled. GPU detection also respects the LLAMA_NO_METAL environment variable.
- verbose: Defaults to False.
Built-in batching: The encode method delegates to model.embed(data), which uses llama.cpp's internal batching via the n_batch parameter.

Usage

Use LlamaCpp vectors when you want to run quantized GGUF embedding models for memory-efficient, fast inference. This is ideal for deploying embedding models on CPU-only machines, using quantized models to reduce memory footprint, or running models in environments without PyTorch.

Code Reference

Source Location

Repository: Neuml_Txtai
File: src/python/txtai/vectors/dense/llama.py

Signature

class LlamaCpp(Vectors):
    @staticmethod
    def ismodel(path) -> bool
    def __init__(self, config, scoring, models)
    def loadmodel(self, path) -> Llama
    def encode(self, data, category=None) -> ndarray
    def download(self, path) -> str

Import

from txtai.vectors.dense.llama import LlamaCpp

I/O Contract

Inputs

Name	Type	Required	Description
config	dict	Yes	Configuration dictionary. Must include path (str, local GGUF file path or HF Hub path like "user/repo/model.gguf"). Optional keys: vectors (dict of llama.cpp model args), maxlength (int for context length), encodebatch (int, default 64), gpu (bool, default True).
scoring	Scoring	No	Optional scoring instance for token weighting.
models	object	No	Shared models cache instance.
data	list[str]	Yes (encode)	List of text strings to generate embeddings for.
category	str	No	Optional category hint (not used).
path (ismodel)	str	Yes (ismodel)	File path to check for ".gguf" extension.

Outputs

Name	Type	Description
embeddings	ndarray (float32)	NumPy array of embedding vectors with shape (n, dimensions).
ismodel	bool	True if the path ends with ".gguf".
model	Llama	Loaded llama.cpp model instance with embedding mode enabled.
local_path	str	Cached local file path after downloading from HF Hub.

Usage Examples

from txtai.embeddings import Embeddings

# Use a GGUF embedding model from Hugging Face Hub
embeddings = Embeddings({
    "path": "second-state/All-MiniLM-L6-v2-Embedding-GGUF/all-MiniLM-L6-v2-Q4_K_M.gguf",
    "vectors": {
        "n_ctx": 512,
        "n_gpu_layers": -1
    }
})

# Index documents
embeddings.index([
    (0, "machine learning with quantized models", None),
    (1, "efficient inference on edge devices", None),
    (2, "natural language understanding tasks", None),
])

# Search
results = embeddings.search("quantized model inference", limit=5)

# Use a local GGUF file
embeddings = Embeddings({
    "path": "/path/to/model.gguf",
    "gpu": False
})

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment