Implementation:Neuml Txtai Llama Vectors
| Knowledge Sources | |
|---|---|
| Domains | Embeddings, Vectors, GGUF, Local Inference |
| Last Updated | 2026-02-10 01:00 GMT |
Overview
Concrete tool for building embedding vectors using llama.cpp (GGUF models) provided by txtai.
Description
The LlamaCpp class extends the base Vectors class to generate embeddings using the llama.cpp library via the llama-cpp-python binding. This enables running quantized GGUF-format embedding models efficiently on CPU or GPU without requiring PyTorch.
Key features:
- Model detection: The static ismodel method identifies llama.cpp models by checking if the path ends with ".gguf" (case-insensitive).
- Hugging Face Hub download: If the model path is not a local file, download splits it into repo ID and filename components and downloads it from the HF Hub via
hf_hub_download. - Configurable model parameters: The loadmodel method accepts additional parameters via the vectors config key and sets sensible defaults:
- n_ctx: Defaults to maxlength config value or 0 (which uses the model's training context length).
- n_batch: Defaults to encodebatch config value or 64.
- n_gpu_layers: Defaults to -1 (all layers on GPU) when GPU is enabled, or 0 when disabled. GPU detection also respects the LLAMA_NO_METAL environment variable.
- verbose: Defaults to False.
- Built-in batching: The encode method delegates to
model.embed(data), which uses llama.cpp's internal batching via the n_batch parameter.
Usage
Use LlamaCpp vectors when you want to run quantized GGUF embedding models for memory-efficient, fast inference. This is ideal for deploying embedding models on CPU-only machines, using quantized models to reduce memory footprint, or running models in environments without PyTorch.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File:
src/python/txtai/vectors/dense/llama.py
Signature
class LlamaCpp(Vectors):
@staticmethod
def ismodel(path) -> bool
def __init__(self, config, scoring, models)
def loadmodel(self, path) -> Llama
def encode(self, data, category=None) -> ndarray
def download(self, path) -> str
Import
from txtai.vectors.dense.llama import LlamaCpp
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | dict | Yes | Configuration dictionary. Must include path (str, local GGUF file path or HF Hub path like "user/repo/model.gguf"). Optional keys: vectors (dict of llama.cpp model args), maxlength (int for context length), encodebatch (int, default 64), gpu (bool, default True). |
| scoring | Scoring | No | Optional scoring instance for token weighting. |
| models | object | No | Shared models cache instance. |
| data | list[str] | Yes (encode) | List of text strings to generate embeddings for. |
| category | str | No | Optional category hint (not used). |
| path (ismodel) | str | Yes (ismodel) | File path to check for ".gguf" extension. |
Outputs
| Name | Type | Description |
|---|---|---|
| embeddings | ndarray (float32) | NumPy array of embedding vectors with shape (n, dimensions). |
| ismodel | bool | True if the path ends with ".gguf". |
| model | Llama | Loaded llama.cpp model instance with embedding mode enabled. |
| local_path | str | Cached local file path after downloading from HF Hub. |
Usage Examples
from txtai.embeddings import Embeddings
# Use a GGUF embedding model from Hugging Face Hub
embeddings = Embeddings({
"path": "second-state/All-MiniLM-L6-v2-Embedding-GGUF/all-MiniLM-L6-v2-Q4_K_M.gguf",
"vectors": {
"n_ctx": 512,
"n_gpu_layers": -1
}
})
# Index documents
embeddings.index([
(0, "machine learning with quantized models", None),
(1, "efficient inference on edge devices", None),
(2, "natural language understanding tasks", None),
])
# Search
results = embeddings.search("quantized model inference", limit=5)
# Use a local GGUF file
embeddings = Embeddings({
"path": "/path/to/model.gguf",
"gpu": False
})