Environment:FlagOpen FlagEmbedding GPU Accelerator Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning |
| Last Updated | 2026-02-09 21:00 GMT |
Overview
GPU-accelerated environment supporting NVIDIA CUDA, Huawei NPU, Moore Threads MUSA, and Apple MPS backends for FlagEmbedding inference.
Description
FlagEmbedding implements a multi-backend accelerator detection system that automatically selects the best available hardware. The priority order is: CUDA (NVIDIA) > NPU (Huawei Ascend) > MUSA (Moore Threads) > MPS (Apple Silicon) > CPU. The framework supports multi-GPU inference through process pools, and automatically manages FP16/BF16 precision based on device capabilities. CPU fallback is always available but disables FP16 mode.
Usage
Use this environment for any GPU-accelerated inference with embedders or rerankers. It is recommended for production workloads where throughput and latency matter. Multi-GPU setups are automatically detected and utilized for parallel encoding via process pools.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Hardware (NVIDIA) | CUDA-capable GPU | Any modern NVIDIA GPU with CUDA support |
| Hardware (Huawei) | Ascend NPU | Requires `torch_npu` package |
| Hardware (Moore Threads) | MUSA GPU | Requires `torch_musa` package |
| Hardware (Apple) | Apple Silicon (M1+) | MPS backend via PyTorch |
| VRAM | Varies by model | Encoder models: 2-8GB; Decoder models: 8-24GB+ |
Dependencies
System Packages
- NVIDIA CUDA toolkit (for CUDA backend)
- NVIDIA cuDNN (for CUDA backend)
Python Packages
- `torch` >= 1.6.0 (core requirement)
- `torch_npu` (optional, for Huawei NPU support)
- `torch_musa` (optional, for Moore Threads MUSA support)
Credentials
No credentials required for GPU acceleration.
Quick Install
# For NVIDIA CUDA (most common)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# For Huawei NPU (optional)
pip install torch_npu
# For Moore Threads MUSA (optional)
# Follow Moore Threads installation guide for torch_musa
Code Evidence
Multi-backend device detection from `FlagEmbedding/abc/inference/AbsEmbedder.py:109-122`:
if devices is None:
if torch.cuda.is_available():
return [f"cuda:{i}" for i in range(torch.cuda.device_count())]
elif is_torch_npu_available():
return [f"npu:{i}" for i in range(torch.npu.device_count())]
elif hasattr(torch, "musa") and torch.musa.is_available():
return [f"musa:{i}" for i in range(torch.musa.device_count())]
elif torch.backends.mps.is_available():
try:
return [f"mps:{i}" for i in range(torch.mps.device_count())]
except:
return ["mps"]
else:
return ["cpu"]
Optional MUSA import from `FlagEmbedding/abc/inference/AbsEmbedder.py:16-19`:
try:
import torch_musa
except Exception:
pass
NPU availability import from `FlagEmbedding/abc/inference/AbsEmbedder.py:14`:
from transformers import is_torch_npu_available
FP16 disabled on CPU from `FlagEmbedding/inference/reranker/encoder_only/base.py:113`:
if device == "cpu":
self.use_fp16 = False
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `torch.cuda.OutOfMemoryError` | Insufficient GPU VRAM for batch size | Reduce `batch_size`; the framework auto-reduces by 25% on OOM |
| `RuntimeError: CUDA error` | CUDA driver/toolkit mismatch | Verify CUDA toolkit version matches PyTorch build |
| FP16 produces `NaN` on CPU | CPU does not support FP16 natively | Set `use_fp16=False` when running on CPU (auto-detected) |
Compatibility Notes
- NVIDIA CUDA: Primary and most tested backend. All features supported including multi-GPU and FP16/BF16.
- Huawei NPU: Supported via `torch_npu`. Uses `is_torch_npu_available()` from transformers.
- Moore Threads MUSA: Supported via `torch_musa`. Detected through `hasattr(torch, "musa")`.
- Apple MPS: Supported for single-device inference. Multi-device MPS may not be available on all systems.
- CPU: Always available as fallback. FP16 is automatically disabled on CPU devices.
- Multi-GPU: When multiple GPUs are detected, inference automatically uses process pools for parallel encoding across all devices.