Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ggml org Ggml CANN NPU Computation

From Leeroopedia


Field Value
sources GGML CANN Documentation
domains Hardware_Acceleration, NPU
last_updated 2026-02-10

Overview

CANN NPU Computation is the principle of offloading tensor operations to Huawei Ascend Neural Processing Units (NPUs) via the Compute Architecture for Neural Networks (CANN) software stack and the Ascend Computing Language (ACL) runtime.

Description

Huawei's Ascend series of AI processors are purpose-built neural processing units that include dedicated matrix computation units (Cube Units) and vector computation units. The CANN (Compute Architecture for Neural Networks) software stack provides the programming interface to these processors, analogous to how CUDA provides access to NVIDIA GPUs.

The CANN stack is organized in layers:

  • ACL (Ascend Computing Language) -- The low-level C API for device management, memory allocation, stream (queue) management, and kernel dispatch
  • ACLNN Operators -- Pre-built, highly optimized operator implementations for common neural network operations (matrix multiply, convolution, element-wise ops, etc.)
  • AscendCL Runtime -- Manages device contexts, memory transfers between host and device, and synchronization

GGML's CANN backend maps GGML tensor operations onto ACLNN operator calls. The key architectural concepts are:

  • Device and Context Management -- Each GGML backend instance binds to a specific Ascend device and maintains an ACL context and stream
  • Memory Management -- Device buffers are allocated via aclrtMalloc and managed through GGML's buffer type abstraction; host-device transfers use aclrtMemcpy
  • Operator Dispatch -- Each supported GGML operation (e.g., GGML_OP_MUL_MAT) is mapped to one or more ACLNN operator calls (e.g., aclnnMatmul) that execute asynchronously on the NPU
  • Synchronization -- Operations are enqueued on ACL streams and synchronized at graph computation boundaries

Usage

Apply the CANN NPU computation principle when:

  • The target hardware is a Huawei Ascend NPU (e.g., Ascend 310, 910 series)
  • The CANN toolkit and ACL runtime are installed on the system
  • The workload is compute-bound with operations that have efficient ACLNN implementations (matrix multiply, normalization, activation functions)
  • Low-power, high-throughput inference is required in datacenter or edge deployments using Ascend hardware

This approach is not suitable when:

  • The target hardware does not include an Ascend NPU
  • Operations in the model are not supported by the ACLNN operator library (in which case they fall back to CPU)

Theoretical Basis

The execution model for CANN NPU computation follows a host-driven, asynchronous dispatch pattern:

 Initialization:
 1. Enumerate available Ascend devices via ACL runtime
 2. Create a device context (aclrtSetDevice) and computation stream (aclrtCreateStream)
 3. Allocate device memory buffers for model weights and activations
 Graph Execution:
 For each node in the GGML computation graph:
   1. Map operation -- Translate the GGML op type to an ACLNN operator
      Example: GGML_OP_MUL_MAT -> aclnnMatmul
   2. Prepare descriptors -- Create aclTensorDesc for each input/output tensor,
      specifying shape, data type, and memory format
   3. Get workspace size -- Query the operator for required temporary workspace
      aclnnXxxGetWorkspaceSize(inputs, outputs, &workspace_size, &executor)
   4. Allocate workspace -- Allocate temporary device memory if needed
   5. Dispatch operator -- Execute the operator asynchronously on the stream
      aclnnXxx(workspace, workspace_size, executor, stream)
   6. Enqueue next operation -- Operations on the same stream execute in order
 Synchronization:
 After all graph nodes are dispatched:
   aclrtSynchronizeStream(stream)
   -- Blocks host until all enqueued operations complete
 Memory Transfer:
 Host-to-device: aclrtMemcpy(dst_dev, src_host, size, ACL_MEMCPY_HOST_TO_DEVICE)
 Device-to-host: aclrtMemcpy(dst_host, src_dev, size, ACL_MEMCPY_DEVICE_TO_HOST)

The Ascend NPU's Cube Unit is specifically designed for matrix multiplication, achieving high throughput for both float16 and int8 operations. The ACLNN operators internally select optimal tiling strategies and data layouts for the specific Ascend chip variant.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment