Principle:Ggml org Ggml Hexagon DSP Computation

Field	Value
sources	GGML Qualcomm Hexagon DSP SDK
domains	DSP, Mobile, Qualcomm
last_updated	2026-02-10

Overview

Hexagon DSP Computation is the principle of offloading tensor operations to Qualcomm's Hexagon Digital Signal Processor (DSP) using HVX (Hexagon Vector eXtensions) vector instructions, communicated via FastRPC for efficient host-DSP interaction on mobile devices.

Description

Qualcomm's Hexagon DSP is a programmable digital signal processor embedded in Snapdragon mobile SoCs. Unlike the GPU (Adreno) or CPU (Kryo), the Hexagon DSP is designed for sustained, energy-efficient vector processing workloads. Its HVX (Hexagon Vector eXtensions) provide 128-byte (1024-bit) wide vector registers, making it capable of high-throughput quantized integer arithmetic.

The GGML Hexagon backend offloads tensor operations to the Hexagon Tensor Processor (HTP), a neural-network-focused subsystem built on top of the Hexagon DSP. The architecture involves several layers:

Host-DSP Communication (FastRPC)

The Hexagon DSP runs as a separate processor with its own address space. Communication between the application processor (AP, running Linux/Android) and the DSP uses Qualcomm's FastRPC mechanism:

dspqueue -- A message queue abstraction for sending operation requests from the host to the DSP
rpcmem -- Shared memory allocation between host and DSP (avoids copying data across the bus)
Skeleton/stub -- Auto-generated RPC interface code that marshals function calls across the AP-DSP boundary

HTP (Hexagon Tensor Processor) Driver

The GGML backend implements an HTP driver layer (htp-drv.cpp) that manages:

Device initialization -- Opening the DSP device, setting power/performance modes, and configuring HVX thread count
Operation dispatch -- Sending tensor operation descriptors to the DSP for execution
Memory management -- Allocating rpcmem-backed buffers accessible from both the host and DSP

HVX Vector Processing

On the DSP side, tensor operations execute using HVX vector instructions. HVX provides:

1024-bit vector registers -- Process 128 bytes (e.g., 128 int8 values) per instruction
Vector multiply-accumulate -- Efficient quantized dot products
Vector permute and shuffle -- Data rearrangement for quantization/dequantization
Hardware loops -- Zero-overhead loop execution for vector pipelines

Operation Pipeline

The backend supports a pipelined execution model with configurable stages:

QUEUE -- Enqueue operations to the DSP command queue
QUANTIZE -- Perform on-DSP quantization of input data
COMPUTE -- Execute the tensor operation using HVX

These stages can execute concurrently through the operation mask (opt_opmask), enabling overlapped execution for improved throughput.

Usage

Apply Hexagon DSP computation when:

Targeting Qualcomm Snapdragon SoCs with Hexagon DSP/HTP (smartphones, IoT devices, edge compute)
The Hexagon SDK and FastRPC runtime are available
Energy-efficient inference is a priority (DSP is more power-efficient than GPU for sustained workloads)
Quantized models (INT8, INT4) are being used, which map well to HVX integer arithmetic
The CPU and GPU are busy with other tasks (UI rendering, sensor processing) and the DSP is available

Hexagon DSP is not ideal when:

The target device lacks a Hexagon DSP
Float32 operations dominate (HVX is primarily optimized for integer/fixed-point)
The model requires operations not implemented in the DSP kernel library

Theoretical Basis

The Hexagon DSP computation model for GGML:

 Initialization:
 1. Open DSP device:
    Initialize FastRPC connection to the Hexagon DSP
    Configure HTP power mode and performance settings

 2. Allocate shared memory:
    For each model buffer:
      ptr = rpcmem_alloc(heap_id, flags, size)
      -- Allocates memory visible to both AP and DSP
      -- No explicit copy needed for data transfer

 3. Configure DSP resources:
    Set number of HVX threads (opt_nhvx)
    Detect Hexagon architecture version (opt_arch, autodetect)
    Configure operation pipeline mask (QUEUE | QUANTIZE | COMPUTE)

 Graph Execution:
 For each node in the computation graph:

   1. Create operation descriptor:
      op_desc = {
        op_type: GGML op enum,
        src tensors: pointers to rpcmem buffers,
        dst tensor: pointer to rpcmem output buffer,
        parameters: dimensions, strides, quantization type
      }

   2. Enqueue to DSP:
      dspqueue_write(queue, op_desc)
      -- Non-blocking: returns immediately, DSP processes asynchronously

   3. DSP execution (on Hexagon):
      a. Receive operation from queue
      b. Optionally quantize input data to HVX-friendly format
      c. Compute using HVX:
         For quantized dot product (conceptual):
           // Load 128 bytes (128 x int8) per HVX vector register
           v_weights = vmem(weight_ptr)    // 128 int8 weights
           v_acts    = vmem(act_ptr)       // 128 int8 activations
           v_acc     = vmpyacc(v_acc, v_weights, v_acts)  // vector MAC
           // ... continue for all K elements
           result    = vreduce(v_acc)      // horizontal reduction
      d. Write result to output rpcmem buffer

 Synchronization:
 If synchronous mode (opt_opsync):
   Wait for DSP to complete each operation before enqueuing next
 Else:
   Pipeline multiple operations and synchronize at graph boundaries
   dspqueue_sync(queue)

 Memory Flow:
 Host writes to rpcmem buffer -> DSP reads same physical memory (cache-coherent)
 DSP writes result to rpcmem buffer -> Host reads same physical memory
 No explicit memcpy needed (FastRPC + ION/rpcmem handles coherency)

Related Pages

Implementation:Ggml_org_Ggml_Hexagon_backend
Ggml_org_Ggml_Hexagon_backend -- The backend implementation that applies this principle
Ggml_org_Ggml_OpenCL_GPU_Computation -- Alternative mobile acceleration using Adreno GPU via OpenCL
Ggml_org_Ggml_CPU_Compute_Engine -- CPU fallback for unsupported operations

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment