Principle:Ggml org Ggml CANN NPU Computation

Field	Value
sources	GGML CANN Documentation
domains	Hardware_Acceleration, NPU
last_updated	2026-02-10

Overview

CANN NPU Computation is the principle of offloading tensor operations to Huawei Ascend Neural Processing Units (NPUs) via the Compute Architecture for Neural Networks (CANN) software stack and the Ascend Computing Language (ACL) runtime.

Description

Huawei's Ascend series of AI processors are purpose-built neural processing units that include dedicated matrix computation units (Cube Units) and vector computation units. The CANN (Compute Architecture for Neural Networks) software stack provides the programming interface to these processors, analogous to how CUDA provides access to NVIDIA GPUs.

The CANN stack is organized in layers:

ACL (Ascend Computing Language) -- The low-level C API for device management, memory allocation, stream (queue) management, and kernel dispatch
ACLNN Operators -- Pre-built, highly optimized operator implementations for common neural network operations (matrix multiply, convolution, element-wise ops, etc.)
AscendCL Runtime -- Manages device contexts, memory transfers between host and device, and synchronization

GGML's CANN backend maps GGML tensor operations onto ACLNN operator calls. The key architectural concepts are:

Device and Context Management -- Each GGML backend instance binds to a specific Ascend device and maintains an ACL context and stream
Memory Management -- Device buffers are allocated via aclrtMalloc and managed through GGML's buffer type abstraction; host-device transfers use aclrtMemcpy
Operator Dispatch -- Each supported GGML operation (e.g., GGML_OP_MUL_MAT) is mapped to one or more ACLNN operator calls (e.g., aclnnMatmul) that execute asynchronously on the NPU
Synchronization -- Operations are enqueued on ACL streams and synchronized at graph computation boundaries

Usage

Apply the CANN NPU computation principle when:

The target hardware is a Huawei Ascend NPU (e.g., Ascend 310, 910 series)
The CANN toolkit and ACL runtime are installed on the system
The workload is compute-bound with operations that have efficient ACLNN implementations (matrix multiply, normalization, activation functions)
Low-power, high-throughput inference is required in datacenter or edge deployments using Ascend hardware

This approach is not suitable when:

The target hardware does not include an Ascend NPU
Operations in the model are not supported by the ACLNN operator library (in which case they fall back to CPU)

Theoretical Basis

The execution model for CANN NPU computation follows a host-driven, asynchronous dispatch pattern:

 Initialization:
 1. Enumerate available Ascend devices via ACL runtime
 2. Create a device context (aclrtSetDevice) and computation stream (aclrtCreateStream)
 3. Allocate device memory buffers for model weights and activations

 Graph Execution:
 For each node in the GGML computation graph:
   1. Map operation -- Translate the GGML op type to an ACLNN operator
      Example: GGML_OP_MUL_MAT -> aclnnMatmul

   2. Prepare descriptors -- Create aclTensorDesc for each input/output tensor,
      specifying shape, data type, and memory format

   3. Get workspace size -- Query the operator for required temporary workspace
      aclnnXxxGetWorkspaceSize(inputs, outputs, &workspace_size, &executor)

   4. Allocate workspace -- Allocate temporary device memory if needed

   5. Dispatch operator -- Execute the operator asynchronously on the stream
      aclnnXxx(workspace, workspace_size, executor, stream)

   6. Enqueue next operation -- Operations on the same stream execute in order

 Synchronization:
 After all graph nodes are dispatched:
   aclrtSynchronizeStream(stream)
   -- Blocks host until all enqueued operations complete

 Memory Transfer:
 Host-to-device: aclrtMemcpy(dst_dev, src_host, size, ACL_MEMCPY_HOST_TO_DEVICE)
 Device-to-host: aclrtMemcpy(dst_host, src_dev, size, ACL_MEMCPY_DEVICE_TO_HOST)

The Ascend NPU's Cube Unit is specifically designed for matrix multiplication, achieving high throughput for both float16 and int8 operations. The ACLNN operators internally select optimal tiling strategies and data layouts for the specific Ascend chip variant.

Related Pages

Implementation:Ggml_org_Ggml_Cann_backend
Ggml_org_Ggml_Cann_backend -- The backend implementation that applies this principle
Ggml_org_Ggml_CPU_Compute_Engine -- CPU fallback for operations not supported on CANN

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment