Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai Llama Vectors

From Leeroopedia


Knowledge Sources
Domains Embeddings, Vectors, GGUF, Local Inference
Last Updated 2026-02-10 01:00 GMT

Overview

Concrete tool for building embedding vectors using llama.cpp (GGUF models) provided by txtai.

Description

The LlamaCpp class extends the base Vectors class to generate embeddings using the llama.cpp library via the llama-cpp-python binding. This enables running quantized GGUF-format embedding models efficiently on CPU or GPU without requiring PyTorch.

Key features:

  • Model detection: The static ismodel method identifies llama.cpp models by checking if the path ends with ".gguf" (case-insensitive).
  • Hugging Face Hub download: If the model path is not a local file, download splits it into repo ID and filename components and downloads it from the HF Hub via hf_hub_download.
  • Configurable model parameters: The loadmodel method accepts additional parameters via the vectors config key and sets sensible defaults:
    • n_ctx: Defaults to maxlength config value or 0 (which uses the model's training context length).
    • n_batch: Defaults to encodebatch config value or 64.
    • n_gpu_layers: Defaults to -1 (all layers on GPU) when GPU is enabled, or 0 when disabled. GPU detection also respects the LLAMA_NO_METAL environment variable.
    • verbose: Defaults to False.
  • Built-in batching: The encode method delegates to model.embed(data), which uses llama.cpp's internal batching via the n_batch parameter.

Usage

Use LlamaCpp vectors when you want to run quantized GGUF embedding models for memory-efficient, fast inference. This is ideal for deploying embedding models on CPU-only machines, using quantized models to reduce memory footprint, or running models in environments without PyTorch.

Code Reference

Source Location

  • Repository: Neuml_Txtai
  • File: src/python/txtai/vectors/dense/llama.py

Signature

class LlamaCpp(Vectors):
    @staticmethod
    def ismodel(path) -> bool
    def __init__(self, config, scoring, models)
    def loadmodel(self, path) -> Llama
    def encode(self, data, category=None) -> ndarray
    def download(self, path) -> str

Import

from txtai.vectors.dense.llama import LlamaCpp

I/O Contract

Inputs

Name Type Required Description
config dict Yes Configuration dictionary. Must include path (str, local GGUF file path or HF Hub path like "user/repo/model.gguf"). Optional keys: vectors (dict of llama.cpp model args), maxlength (int for context length), encodebatch (int, default 64), gpu (bool, default True).
scoring Scoring No Optional scoring instance for token weighting.
models object No Shared models cache instance.
data list[str] Yes (encode) List of text strings to generate embeddings for.
category str No Optional category hint (not used).
path (ismodel) str Yes (ismodel) File path to check for ".gguf" extension.

Outputs

Name Type Description
embeddings ndarray (float32) NumPy array of embedding vectors with shape (n, dimensions).
ismodel bool True if the path ends with ".gguf".
model Llama Loaded llama.cpp model instance with embedding mode enabled.
local_path str Cached local file path after downloading from HF Hub.

Usage Examples

from txtai.embeddings import Embeddings

# Use a GGUF embedding model from Hugging Face Hub
embeddings = Embeddings({
    "path": "second-state/All-MiniLM-L6-v2-Embedding-GGUF/all-MiniLM-L6-v2-Q4_K_M.gguf",
    "vectors": {
        "n_ctx": 512,
        "n_gpu_layers": -1
    }
})

# Index documents
embeddings.index([
    (0, "machine learning with quantized models", None),
    (1, "efficient inference on edge devices", None),
    (2, "natural language understanding tasks", None),
])

# Search
results = embeddings.search("quantized model inference", limit=5)

# Use a local GGUF file
embeddings = Embeddings({
    "path": "/path/to/model.gguf",
    "gpu": False
})

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment