Principle:Ggml org Ggml Hexagon DSP Computation
| Field | Value |
|---|---|
| sources | GGML Qualcomm Hexagon DSP SDK |
| domains | DSP, Mobile, Qualcomm |
| last_updated | 2026-02-10 |
Overview
Hexagon DSP Computation is the principle of offloading tensor operations to Qualcomm's Hexagon Digital Signal Processor (DSP) using HVX (Hexagon Vector eXtensions) vector instructions, communicated via FastRPC for efficient host-DSP interaction on mobile devices.
Description
Qualcomm's Hexagon DSP is a programmable digital signal processor embedded in Snapdragon mobile SoCs. Unlike the GPU (Adreno) or CPU (Kryo), the Hexagon DSP is designed for sustained, energy-efficient vector processing workloads. Its HVX (Hexagon Vector eXtensions) provide 128-byte (1024-bit) wide vector registers, making it capable of high-throughput quantized integer arithmetic.
The GGML Hexagon backend offloads tensor operations to the Hexagon Tensor Processor (HTP), a neural-network-focused subsystem built on top of the Hexagon DSP. The architecture involves several layers:
Host-DSP Communication (FastRPC)
The Hexagon DSP runs as a separate processor with its own address space. Communication between the application processor (AP, running Linux/Android) and the DSP uses Qualcomm's FastRPC mechanism:
- dspqueue -- A message queue abstraction for sending operation requests from the host to the DSP
- rpcmem -- Shared memory allocation between host and DSP (avoids copying data across the bus)
- Skeleton/stub -- Auto-generated RPC interface code that marshals function calls across the AP-DSP boundary
HTP (Hexagon Tensor Processor) Driver
The GGML backend implements an HTP driver layer (htp-drv.cpp) that manages:
- Device initialization -- Opening the DSP device, setting power/performance modes, and configuring HVX thread count
- Operation dispatch -- Sending tensor operation descriptors to the DSP for execution
- Memory management -- Allocating rpcmem-backed buffers accessible from both the host and DSP
HVX Vector Processing
On the DSP side, tensor operations execute using HVX vector instructions. HVX provides:
- 1024-bit vector registers -- Process 128 bytes (e.g., 128 int8 values) per instruction
- Vector multiply-accumulate -- Efficient quantized dot products
- Vector permute and shuffle -- Data rearrangement for quantization/dequantization
- Hardware loops -- Zero-overhead loop execution for vector pipelines
Operation Pipeline
The backend supports a pipelined execution model with configurable stages:
- QUEUE -- Enqueue operations to the DSP command queue
- QUANTIZE -- Perform on-DSP quantization of input data
- COMPUTE -- Execute the tensor operation using HVX
These stages can execute concurrently through the operation mask (opt_opmask), enabling overlapped execution for improved throughput.
Usage
Apply Hexagon DSP computation when:
- Targeting Qualcomm Snapdragon SoCs with Hexagon DSP/HTP (smartphones, IoT devices, edge compute)
- The Hexagon SDK and FastRPC runtime are available
- Energy-efficient inference is a priority (DSP is more power-efficient than GPU for sustained workloads)
- Quantized models (INT8, INT4) are being used, which map well to HVX integer arithmetic
- The CPU and GPU are busy with other tasks (UI rendering, sensor processing) and the DSP is available
Hexagon DSP is not ideal when:
- The target device lacks a Hexagon DSP
- Float32 operations dominate (HVX is primarily optimized for integer/fixed-point)
- The model requires operations not implemented in the DSP kernel library
Theoretical Basis
The Hexagon DSP computation model for GGML:
Initialization:
1. Open DSP device:
Initialize FastRPC connection to the Hexagon DSP
Configure HTP power mode and performance settings
2. Allocate shared memory:
For each model buffer:
ptr = rpcmem_alloc(heap_id, flags, size)
-- Allocates memory visible to both AP and DSP
-- No explicit copy needed for data transfer
3. Configure DSP resources:
Set number of HVX threads (opt_nhvx)
Detect Hexagon architecture version (opt_arch, autodetect)
Configure operation pipeline mask (QUEUE | QUANTIZE | COMPUTE)
Graph Execution: For each node in the computation graph:
1. Create operation descriptor:
op_desc = {
op_type: GGML op enum,
src tensors: pointers to rpcmem buffers,
dst tensor: pointer to rpcmem output buffer,
parameters: dimensions, strides, quantization type
}
2. Enqueue to DSP:
dspqueue_write(queue, op_desc)
-- Non-blocking: returns immediately, DSP processes asynchronously
3. DSP execution (on Hexagon):
a. Receive operation from queue
b. Optionally quantize input data to HVX-friendly format
c. Compute using HVX:
For quantized dot product (conceptual):
// Load 128 bytes (128 x int8) per HVX vector register
v_weights = vmem(weight_ptr) // 128 int8 weights
v_acts = vmem(act_ptr) // 128 int8 activations
v_acc = vmpyacc(v_acc, v_weights, v_acts) // vector MAC
// ... continue for all K elements
result = vreduce(v_acc) // horizontal reduction
d. Write result to output rpcmem buffer
Synchronization: If synchronous mode (opt_opsync): Wait for DSP to complete each operation before enqueuing next Else: Pipeline multiple operations and synchronize at graph boundaries dspqueue_sync(queue)
Memory Flow: Host writes to rpcmem buffer -> DSP reads same physical memory (cache-coherent) DSP writes result to rpcmem buffer -> Host reads same physical memory No explicit memcpy needed (FastRPC + ION/rpcmem handles coherency)
Related Pages
- Implementation:Ggml_org_Ggml_Hexagon_backend
- Ggml_org_Ggml_Hexagon_backend -- The backend implementation that applies this principle
- Ggml_org_Ggml_OpenCL_GPU_Computation -- Alternative mobile acceleration using Adreno GPU via OpenCL
- Ggml_org_Ggml_CPU_Compute_Engine -- CPU fallback for unsupported operations