Heuristic:Neuml Txtai Llama Cpp Context Fallback
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Optimization |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Automatic context window fallback for llama.cpp models when the training context size exceeds available memory.
Description
txtai's llama.cpp integration implements a two-stage context window strategy. By default, `n_ctx` is set to 0, which tells llama.cpp to use the model's training context length (`n_ctx_train`). If this causes a `ValueError` (typically due to insufficient memory), txtai automatically retries without specifying `n_ctx`, falling back to llama.cpp's default smaller context window. GPU layer offloading defaults to all layers (`n_gpu_layers=-1`) unless explicitly disabled via the `LLAMA_NO_METAL` environment variable on macOS.
Usage
This heuristic applies when using GGUF models via llama.cpp for text generation or embedding. It prevents out-of-memory failures on machines with limited RAM/VRAM by automatically reducing the context window. Users can override by explicitly setting `n_ctx` to a specific value, which disables the fallback.
The Insight (Rule of Thumb)
- Action: Let txtai manage `n_ctx` automatically (default). If you need a specific context size, pass `n_ctx=<value>` explicitly.
- Value: Default `n_ctx=0` (use model's training context). Fallback removes `n_ctx` entirely (llama.cpp default ~512).
- Trade-off: Automatic fallback reduces context window silently, which means shorter prompts/responses. Explicit `n_ctx` gives control but will raise errors if memory is insufficient.
Additional defaults:
- `n_gpu_layers`: -1 (all layers on GPU) when GPU is available, 0 when disabled
- `verbose`: False by default (suppress llama.cpp output)
- GGUF models are auto-downloaded from HuggingFace Hub if not found locally
Reasoning
GGUF models vary widely in their training context lengths (2K to 128K+). Allocating the full training context on a machine with limited RAM can cause failures. The fallback strategy ensures models always load, even if with a reduced context. This is particularly important for consumer hardware where VRAM and RAM are limited. The `-1` default for GPU layers ensures maximum GPU utilization when available.
Code Evidence
Context fallback logic from `pipeline/llm/llama.py:91-110`:
# Default n_ctx=0 if not already set. This sets n_ctx = n_ctx_train.
kwargs["n_ctx"] = kwargs.get("n_ctx", 0)
# Default GPU layers if not already set
kwargs["n_gpu_layers"] = kwargs.get("n_gpu_layers", -1 if kwargs.get("gpu", os.environ.get("LLAMA_NO_METAL") != "1") else 0)
# Default verbose flag
kwargs["verbose"] = kwargs.get("verbose", False)
# Create llama.cpp instance
try:
return llama.Llama(model_path=path, **kwargs)
except ValueError as e:
# Fallback to default n_ctx when not enough memory for n_ctx = n_ctx_train
if not kwargs["n_ctx"]:
kwargs.pop("n_ctx")
return llama.Llama(model_path=path, **kwargs)
# Raise exception if n_ctx manually specified
raise e
HuggingFace Hub download fallback from `pipeline/llm/llama.py:59-77`:
def download(self, path):
parts = path.split("/")
repo = 2 if len(parts) > 2 else 1
return hf_hub_download(repo_id="/".join(parts[:repo]), filename="/".join(parts[repo:]))