Heuristic:Neuml Txtai Llama Cpp Context Fallback

Knowledge Sources	txtai llama.cpp
Domains	LLMs, Optimization
Last Updated	2026-02-10 00:00 GMT

Overview

Automatic context window fallback for llama.cpp models when the training context size exceeds available memory.

Description

txtai's llama.cpp integration implements a two-stage context window strategy. By default, `n_ctx` is set to 0, which tells llama.cpp to use the model's training context length (`n_ctx_train`). If this causes a `ValueError` (typically due to insufficient memory), txtai automatically retries without specifying `n_ctx`, falling back to llama.cpp's default smaller context window. GPU layer offloading defaults to all layers (`n_gpu_layers=-1`) unless explicitly disabled via the `LLAMA_NO_METAL` environment variable on macOS.

Usage

This heuristic applies when using GGUF models via llama.cpp for text generation or embedding. It prevents out-of-memory failures on machines with limited RAM/VRAM by automatically reducing the context window. Users can override by explicitly setting `n_ctx` to a specific value, which disables the fallback.

The Insight (Rule of Thumb)

Action: Let txtai manage `n_ctx` automatically (default). If you need a specific context size, pass `n_ctx=<value>` explicitly.
Value: Default `n_ctx=0` (use model's training context). Fallback removes `n_ctx` entirely (llama.cpp default ~512).
Trade-off: Automatic fallback reduces context window silently, which means shorter prompts/responses. Explicit `n_ctx` gives control but will raise errors if memory is insufficient.

Additional defaults:

`n_gpu_layers`: -1 (all layers on GPU) when GPU is available, 0 when disabled
`verbose`: False by default (suppress llama.cpp output)
GGUF models are auto-downloaded from HuggingFace Hub if not found locally

Reasoning

GGUF models vary widely in their training context lengths (2K to 128K+). Allocating the full training context on a machine with limited RAM can cause failures. The fallback strategy ensures models always load, even if with a reduced context. This is particularly important for consumer hardware where VRAM and RAM are limited. The `-1` default for GPU layers ensures maximum GPU utilization when available.

Code Evidence

Context fallback logic from `pipeline/llm/llama.py:91-110`:

# Default n_ctx=0 if not already set. This sets n_ctx = n_ctx_train.
kwargs["n_ctx"] = kwargs.get("n_ctx", 0)

# Default GPU layers if not already set
kwargs["n_gpu_layers"] = kwargs.get("n_gpu_layers", -1 if kwargs.get("gpu", os.environ.get("LLAMA_NO_METAL") != "1") else 0)

# Default verbose flag
kwargs["verbose"] = kwargs.get("verbose", False)

# Create llama.cpp instance
try:
    return llama.Llama(model_path=path, **kwargs)
except ValueError as e:
    # Fallback to default n_ctx when not enough memory for n_ctx = n_ctx_train
    if not kwargs["n_ctx"]:
        kwargs.pop("n_ctx")
        return llama.Llama(model_path=path, **kwargs)

    # Raise exception if n_ctx manually specified
    raise e

HuggingFace Hub download fallback from `pipeline/llm/llama.py:59-77`:

def download(self, path):
    parts = path.split("/")
    repo = 2 if len(parts) > 2 else 1
    return hf_hub_download(repo_id="/".join(parts[:repo]), filename="/".join(parts[repo:]))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment