Environment:FlagOpen FlagEmbedding GPU Accelerator Environment

Knowledge Sources	FlagOpen/FlagEmbedding
Domains	Infrastructure, Deep_Learning
Last Updated	2026-02-09 21:00 GMT

Overview

GPU-accelerated environment supporting NVIDIA CUDA, Huawei NPU, Moore Threads MUSA, and Apple MPS backends for FlagEmbedding inference.

Description

FlagEmbedding implements a multi-backend accelerator detection system that automatically selects the best available hardware. The priority order is: CUDA (NVIDIA) > NPU (Huawei Ascend) > MUSA (Moore Threads) > MPS (Apple Silicon) > CPU. The framework supports multi-GPU inference through process pools, and automatically manages FP16/BF16 precision based on device capabilities. CPU fallback is always available but disables FP16 mode.

Usage

Use this environment for any GPU-accelerated inference with embedders or rerankers. It is recommended for production workloads where throughput and latency matter. Multi-GPU setups are automatically detected and utilized for parallel encoding via process pools.

System Requirements

Category	Requirement	Notes
Hardware (NVIDIA)	CUDA-capable GPU	Any modern NVIDIA GPU with CUDA support
Hardware (Huawei)	Ascend NPU	Requires `torch_npu` package
Hardware (Moore Threads)	MUSA GPU	Requires `torch_musa` package
Hardware (Apple)	Apple Silicon (M1+)	MPS backend via PyTorch
VRAM	Varies by model	Encoder models: 2-8GB; Decoder models: 8-24GB+

Dependencies

System Packages

NVIDIA CUDA toolkit (for CUDA backend)
NVIDIA cuDNN (for CUDA backend)

Python Packages

`torch` >= 1.6.0 (core requirement)
`torch_npu` (optional, for Huawei NPU support)
`torch_musa` (optional, for Moore Threads MUSA support)

Credentials

No credentials required for GPU acceleration.

Quick Install

# For NVIDIA CUDA (most common)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# For Huawei NPU (optional)
pip install torch_npu

# For Moore Threads MUSA (optional)
# Follow Moore Threads installation guide for torch_musa

Code Evidence

Multi-backend device detection from `FlagEmbedding/abc/inference/AbsEmbedder.py:109-122`:

if devices is None:
    if torch.cuda.is_available():
        return [f"cuda:{i}" for i in range(torch.cuda.device_count())]
    elif is_torch_npu_available():
        return [f"npu:{i}" for i in range(torch.npu.device_count())]
    elif hasattr(torch, "musa") and torch.musa.is_available():
        return [f"musa:{i}" for i in range(torch.musa.device_count())]
    elif torch.backends.mps.is_available():
        try:
            return [f"mps:{i}" for i in range(torch.mps.device_count())]
        except:
            return ["mps"]
    else:
        return ["cpu"]

Optional MUSA import from `FlagEmbedding/abc/inference/AbsEmbedder.py:16-19`:

try:
    import torch_musa
except Exception:
    pass

NPU availability import from `FlagEmbedding/abc/inference/AbsEmbedder.py:14`:

from transformers import is_torch_npu_available

FP16 disabled on CPU from `FlagEmbedding/inference/reranker/encoder_only/base.py:113`:

if device == "cpu":
    self.use_fp16 = False

Common Errors

Error Message	Cause	Solution
`torch.cuda.OutOfMemoryError`	Insufficient GPU VRAM for batch size	Reduce `batch_size`; the framework auto-reduces by 25% on OOM
`RuntimeError: CUDA error`	CUDA driver/toolkit mismatch	Verify CUDA toolkit version matches PyTorch build
FP16 produces `NaN` on CPU	CPU does not support FP16 natively	Set `use_fp16=False` when running on CPU (auto-detected)

Compatibility Notes

NVIDIA CUDA: Primary and most tested backend. All features supported including multi-GPU and FP16/BF16.
Huawei NPU: Supported via `torch_npu`. Uses `is_torch_npu_available()` from transformers.
Moore Threads MUSA: Supported via `torch_musa`. Detected through `hasattr(torch, "musa")`.
Apple MPS: Supported for single-device inference. Multi-device MPS may not be available on all systems.
CPU: Always available as fallback. FP16 is automatically disabled on CPU devices.
Multi-GPU: When multiple GPUs are detected, inference automatically uses process pools for parallel encoding across all devices.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment