Heuristic:Neuml Txtai LLM Context Window Fallback
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Troubleshooting, Memory_Management |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Automatic context window fallback strategy for llama.cpp models: attempts maximum context (`n_ctx_train`), then gracefully degrades to default context size when GPU memory is insufficient.
Description
When loading a llama.cpp GGUF model, txtai defaults `n_ctx=0`, which instructs llama.cpp to use the model's full training context length (`n_ctx_train`). This maximizes the available context window for generation. However, large context windows require significant memory (especially on GPU). If model loading fails with a memory error, txtai catches the exception and retries without the `n_ctx` parameter, falling back to llama.cpp's built-in default (typically 512 or 2048 tokens). This provides a "try maximum, fall back gracefully" pattern that avoids OOM crashes while maximizing capability when resources allow.
Usage
This heuristic applies when using llama.cpp models for LLM generation or embedding. It fires automatically during model loading. If you need a specific context window size, set `n_ctx` explicitly in the model kwargs to bypass this fallback logic. The GPU layer offloading (`n_gpu_layers=-1` by default) also interacts with memory availability.
The Insight (Rule of Thumb)
- Action: Let txtai auto-manage context window size, or set `n_ctx` explicitly if you know your memory constraints.
- Default behavior: `n_ctx=0` (use full training context) with automatic fallback on memory error.
- GPU layers: Default `n_gpu_layers=-1` (all layers on GPU) unless `LLAMA_NO_METAL=1` is set.
- Trade-off: Larger context enables longer prompts/completions but requires more memory. The fallback reduces context but prevents crashes.
- Verbose flag: Defaults to `False` to suppress llama.cpp's verbose model loading output.
Reasoning
Different GPU hardware has vastly different memory capacities. A model that fits with full context on an A100 (80GB) may not fit on a consumer GPU (8GB). Rather than requiring users to calculate memory budgets and set context sizes manually, txtai implements an optimistic strategy: try the maximum context first, catch the error, and retry with a smaller context. This approach works because llama.cpp raises a `ValueError` when memory allocation fails during model initialization. The retry cost is minimal (a few seconds) compared to the benefit of automatic context maximization when memory is available.
# From src/python/txtai/pipeline/llm/llama.py:91-110
# Default n_ctx=0 if not already set. This sets n_ctx = n_ctx_train.
kwargs["n_ctx"] = kwargs.get("n_ctx", 0)
# Default GPU layers if not already set
kwargs["n_gpu_layers"] = kwargs.get("n_gpu_layers", -1 if kwargs.get("gpu", os.environ.get("LLAMA_NO_METAL") != "1") else 0)
# Default verbose flag
kwargs["verbose"] = kwargs.get("verbose", False)
# Create llama.cpp instance
try:
return llama.Llama(model_path=path, **kwargs)
except ValueError as e:
# Fallback to default n_ctx when not enough memory for n_ctx = n_ctx_train
if not kwargs["n_ctx"]:
kwargs.pop("n_ctx")
return llama.Llama(model_path=path, **kwargs)
# Raise exception if n_ctx manually specified
raise e