Environment:Neuml Txtai GPU Accelerator Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, GPU_Acceleration, Deep_Learning |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
GPU-accelerated environment supporting NVIDIA CUDA, Apple Metal (MPS), and Intel XPU devices for hardware-accelerated embedding generation, model inference, and ANN index operations.
Description
This environment extends the core Python environment with GPU acceleration. txtai implements a multi-tier device detection strategy: it first checks for NVIDIA CUDA, then Apple Metal Performance Shaders (MPS), then Intel XPU. When no accelerator is detected, all operations fall back to CPU transparently. GPU acceleration significantly improves performance for embedding generation, model inference (LLM pipelines), and certain ANN backends (torch, GGML). The `bitsandbytes` library enables int8 and 4-bit quantization on CUDA devices to reduce VRAM requirements.
Usage
Use this environment when running model inference or training at scale. GPU acceleration is recommended for embedding generation on large document sets, LLM text generation, model fine-tuning via HFTrainer, and GPU-backed ANN indexes (torch, GGML backends). CPU fallback is automatic but substantially slower for these operations.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (recommended), macOS, Windows | Linux for CUDA; macOS for MPS |
| Hardware (CUDA) | NVIDIA GPU with CUDA support | Any modern NVIDIA GPU; VRAM depends on model size |
| Hardware (MPS) | Apple Silicon (M1/M2/M3/M4) | macOS 12.3+ required for MPS support |
| Hardware (XPU) | Intel Arc / Data Center GPU | Requires PyTorch >= 2.4 with XPU support |
| CUDA Toolkit | Compatible with PyTorch version | Typically CUDA 11.8 or 12.x |
Dependencies
System Packages
- NVIDIA CUDA Toolkit (for CUDA GPUs)
- NVIDIA cuDNN (for CUDA GPUs)
Python Packages
- `torch` >= 2.4 (with CUDA/MPS/XPU support compiled)
- `bitsandbytes` >= 0.42.0 (optional, for int8/4-bit quantization on CUDA)
- `ggml-py` >= 0.9.4 (optional, for GGML GPU-accelerated ANN backend)
Credentials
The following environment variables control GPU behavior:
- `PYTORCH_MPS_DISABLE`: Set to `1` to disable Apple MPS device even when available
- `LLAMA_NO_METAL`: Set to `1` to disable llama.cpp Metal acceleration on macOS
- `OMP_NUM_THREADS`: Set to `1` on macOS/Windows to prevent Faiss segmentation faults
- `KMP_DUPLICATE_LIB_OK`: Set to `TRUE` on macOS/Windows to work around OMP Error #15
- `CUDA_VISIBLE_DEVICES`: Standard NVIDIA env var to restrict which GPUs are visible
Quick Install
# GPU-enabled PyTorch (CUDA 12.x)
pip install torch torchvision
# With quantization support
pip install txtai[ann] # includes bitsandbytes
# CPU-only (explicitly, for smaller installs)
pip install torch==2.10.0+cpu -f https://download.pytorch.org/whl/torch
Code Evidence
Multi-tier accelerator detection from `src/python/txtai/models/models.py:150-180`:
@staticmethod
def hasaccelerator():
return torch.cuda.is_available() or Models.hasmpsdevice() or bool(Models.finddevice())
@staticmethod
def hasmpsdevice():
return os.environ.get("PYTORCH_MPS_DISABLE") != "1" and torch.backends.mps.is_available()
@staticmethod
def finddevice():
return next((device for device in ["xpu"] if hasattr(torch, device) and getattr(torch, device).is_available()), None)
Device reference resolution from `src/python/txtai/models/models.py:128-136`:
return (
deviceid
if isinstance(deviceid, str)
else (
"cpu"
if deviceid < 0
else f"cuda:{deviceid}" if torch.cuda.is_available() else "mps" if Models.hasmpsdevice() else Models.finddevice()
)
)
macOS/Windows OpenMP workaround from `src/python/txtai/ann/dense/faiss.py:10-15`:
if platform.system() == "Darwin" or os.name == "nt":
# Workaround for a Faiss issue causing segmentation faults
os.environ["OMP_NUM_THREADS"] = os.environ.get("OMP_NUM_THREADS", "1")
# Workaround for a Faiss issue with OMP: Error #15
os.environ["KMP_DUPLICATE_LIB_OK"] = os.environ.get("KMP_DUPLICATE_LIB_OK", "TRUE")
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `CUDA out of memory` | Insufficient GPU VRAM for model | Reduce batch size, enable quantization, or use smaller model |
| Faiss segmentation fault on macOS | OpenMP threading conflict | Set `OMP_NUM_THREADS=1` before import |
| `OMP: Error #15: Initializing libiomp5` | Duplicate OpenMP libraries | Set `KMP_DUPLICATE_LIB_OK=TRUE` |
| `MPS backend out of memory` | Insufficient Apple Silicon memory | Set `PYTORCH_MPS_DISABLE=1` to fall back to CPU |
| `bitsandbytes is not available` | Missing quantization library | `pip install txtai[ann]` or `pip install bitsandbytes` |
Compatibility Notes
- NVIDIA CUDA: Primary accelerator; full support for all GPU operations including quantization
- Apple MPS: Supported for model inference and embedding generation; can be disabled via `PYTORCH_MPS_DISABLE=1`
- Intel XPU: Requires PyTorch >= 2.4 with XPU support; detected as fallback after CUDA and MPS
- Quantization: `bitsandbytes` int8/4-bit quantization only works on CUDA devices, not MPS or XPU
- llama.cpp Metal: Enabled by default on macOS; disable with `LLAMA_NO_METAL=1` if unstable
- Multi-GPU: Sentence-transformers multiprocessing pooling is only enabled when `device_count > 1`