Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:FlagOpen FlagEmbedding GPU Accelerator Environment

From Leeroopedia
Revision as of 18:38, 16 February 2026 by Admin (talk | contribs) (Auto-imported from environments/FlagOpen_FlagEmbedding_GPU_Accelerator_Environment.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Infrastructure, Deep_Learning
Last Updated 2026-02-09 21:00 GMT

Overview

GPU-accelerated environment supporting NVIDIA CUDA, Huawei NPU, Moore Threads MUSA, and Apple MPS backends for FlagEmbedding inference.

Description

FlagEmbedding implements a multi-backend accelerator detection system that automatically selects the best available hardware. The priority order is: CUDA (NVIDIA) > NPU (Huawei Ascend) > MUSA (Moore Threads) > MPS (Apple Silicon) > CPU. The framework supports multi-GPU inference through process pools, and automatically manages FP16/BF16 precision based on device capabilities. CPU fallback is always available but disables FP16 mode.

Usage

Use this environment for any GPU-accelerated inference with embedders or rerankers. It is recommended for production workloads where throughput and latency matter. Multi-GPU setups are automatically detected and utilized for parallel encoding via process pools.

System Requirements

Category Requirement Notes
Hardware (NVIDIA) CUDA-capable GPU Any modern NVIDIA GPU with CUDA support
Hardware (Huawei) Ascend NPU Requires `torch_npu` package
Hardware (Moore Threads) MUSA GPU Requires `torch_musa` package
Hardware (Apple) Apple Silicon (M1+) MPS backend via PyTorch
VRAM Varies by model Encoder models: 2-8GB; Decoder models: 8-24GB+

Dependencies

System Packages

  • NVIDIA CUDA toolkit (for CUDA backend)
  • NVIDIA cuDNN (for CUDA backend)

Python Packages

  • `torch` >= 1.6.0 (core requirement)
  • `torch_npu` (optional, for Huawei NPU support)
  • `torch_musa` (optional, for Moore Threads MUSA support)

Credentials

No credentials required for GPU acceleration.

Quick Install

# For NVIDIA CUDA (most common)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# For Huawei NPU (optional)
pip install torch_npu

# For Moore Threads MUSA (optional)
# Follow Moore Threads installation guide for torch_musa

Code Evidence

Multi-backend device detection from `FlagEmbedding/abc/inference/AbsEmbedder.py:109-122`:

if devices is None:
    if torch.cuda.is_available():
        return [f"cuda:{i}" for i in range(torch.cuda.device_count())]
    elif is_torch_npu_available():
        return [f"npu:{i}" for i in range(torch.npu.device_count())]
    elif hasattr(torch, "musa") and torch.musa.is_available():
        return [f"musa:{i}" for i in range(torch.musa.device_count())]
    elif torch.backends.mps.is_available():
        try:
            return [f"mps:{i}" for i in range(torch.mps.device_count())]
        except:
            return ["mps"]
    else:
        return ["cpu"]

Optional MUSA import from `FlagEmbedding/abc/inference/AbsEmbedder.py:16-19`:

try:
    import torch_musa
except Exception:
    pass

NPU availability import from `FlagEmbedding/abc/inference/AbsEmbedder.py:14`:

from transformers import is_torch_npu_available

FP16 disabled on CPU from `FlagEmbedding/inference/reranker/encoder_only/base.py:113`:

if device == "cpu":
    self.use_fp16 = False

Common Errors

Error Message Cause Solution
`torch.cuda.OutOfMemoryError` Insufficient GPU VRAM for batch size Reduce `batch_size`; the framework auto-reduces by 25% on OOM
`RuntimeError: CUDA error` CUDA driver/toolkit mismatch Verify CUDA toolkit version matches PyTorch build
FP16 produces `NaN` on CPU CPU does not support FP16 natively Set `use_fp16=False` when running on CPU (auto-detected)

Compatibility Notes

  • NVIDIA CUDA: Primary and most tested backend. All features supported including multi-GPU and FP16/BF16.
  • Huawei NPU: Supported via `torch_npu`. Uses `is_torch_npu_available()` from transformers.
  • Moore Threads MUSA: Supported via `torch_musa`. Detected through `hasattr(torch, "musa")`.
  • Apple MPS: Supported for single-device inference. Multi-device MPS may not be available on all systems.
  • CPU: Always available as fallback. FP16 is automatically disabled on CPU devices.
  • Multi-GPU: When multiple GPUs are detected, inference automatically uses process pools for parallel encoding across all devices.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment