Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ggml org Ggml Hexagon DSP Computation

From Leeroopedia


Field Value
sources GGML Qualcomm Hexagon DSP SDK
domains DSP, Mobile, Qualcomm
last_updated 2026-02-10

Overview

Hexagon DSP Computation is the principle of offloading tensor operations to Qualcomm's Hexagon Digital Signal Processor (DSP) using HVX (Hexagon Vector eXtensions) vector instructions, communicated via FastRPC for efficient host-DSP interaction on mobile devices.

Description

Qualcomm's Hexagon DSP is a programmable digital signal processor embedded in Snapdragon mobile SoCs. Unlike the GPU (Adreno) or CPU (Kryo), the Hexagon DSP is designed for sustained, energy-efficient vector processing workloads. Its HVX (Hexagon Vector eXtensions) provide 128-byte (1024-bit) wide vector registers, making it capable of high-throughput quantized integer arithmetic.

The GGML Hexagon backend offloads tensor operations to the Hexagon Tensor Processor (HTP), a neural-network-focused subsystem built on top of the Hexagon DSP. The architecture involves several layers:

Host-DSP Communication (FastRPC)

The Hexagon DSP runs as a separate processor with its own address space. Communication between the application processor (AP, running Linux/Android) and the DSP uses Qualcomm's FastRPC mechanism:

  • dspqueue -- A message queue abstraction for sending operation requests from the host to the DSP
  • rpcmem -- Shared memory allocation between host and DSP (avoids copying data across the bus)
  • Skeleton/stub -- Auto-generated RPC interface code that marshals function calls across the AP-DSP boundary

HTP (Hexagon Tensor Processor) Driver

The GGML backend implements an HTP driver layer (htp-drv.cpp) that manages:

  • Device initialization -- Opening the DSP device, setting power/performance modes, and configuring HVX thread count
  • Operation dispatch -- Sending tensor operation descriptors to the DSP for execution
  • Memory management -- Allocating rpcmem-backed buffers accessible from both the host and DSP

HVX Vector Processing

On the DSP side, tensor operations execute using HVX vector instructions. HVX provides:

  • 1024-bit vector registers -- Process 128 bytes (e.g., 128 int8 values) per instruction
  • Vector multiply-accumulate -- Efficient quantized dot products
  • Vector permute and shuffle -- Data rearrangement for quantization/dequantization
  • Hardware loops -- Zero-overhead loop execution for vector pipelines

Operation Pipeline

The backend supports a pipelined execution model with configurable stages:

  • QUEUE -- Enqueue operations to the DSP command queue
  • QUANTIZE -- Perform on-DSP quantization of input data
  • COMPUTE -- Execute the tensor operation using HVX

These stages can execute concurrently through the operation mask (opt_opmask), enabling overlapped execution for improved throughput.

Usage

Apply Hexagon DSP computation when:

  • Targeting Qualcomm Snapdragon SoCs with Hexagon DSP/HTP (smartphones, IoT devices, edge compute)
  • The Hexagon SDK and FastRPC runtime are available
  • Energy-efficient inference is a priority (DSP is more power-efficient than GPU for sustained workloads)
  • Quantized models (INT8, INT4) are being used, which map well to HVX integer arithmetic
  • The CPU and GPU are busy with other tasks (UI rendering, sensor processing) and the DSP is available

Hexagon DSP is not ideal when:

  • The target device lacks a Hexagon DSP
  • Float32 operations dominate (HVX is primarily optimized for integer/fixed-point)
  • The model requires operations not implemented in the DSP kernel library

Theoretical Basis

The Hexagon DSP computation model for GGML:

 Initialization:
 1. Open DSP device:
    Initialize FastRPC connection to the Hexagon DSP
    Configure HTP power mode and performance settings
 2. Allocate shared memory:
    For each model buffer:
      ptr = rpcmem_alloc(heap_id, flags, size)
      -- Allocates memory visible to both AP and DSP
      -- No explicit copy needed for data transfer
 3. Configure DSP resources:
    Set number of HVX threads (opt_nhvx)
    Detect Hexagon architecture version (opt_arch, autodetect)
    Configure operation pipeline mask (QUEUE | QUANTIZE | COMPUTE)
 Graph Execution:
 For each node in the computation graph:
   1. Create operation descriptor:
      op_desc = {
        op_type: GGML op enum,
        src tensors: pointers to rpcmem buffers,
        dst tensor: pointer to rpcmem output buffer,
        parameters: dimensions, strides, quantization type
      }
   2. Enqueue to DSP:
      dspqueue_write(queue, op_desc)
      -- Non-blocking: returns immediately, DSP processes asynchronously
   3. DSP execution (on Hexagon):
      a. Receive operation from queue
      b. Optionally quantize input data to HVX-friendly format
      c. Compute using HVX:
         For quantized dot product (conceptual):
           // Load 128 bytes (128 x int8) per HVX vector register
           v_weights = vmem(weight_ptr)    // 128 int8 weights
           v_acts    = vmem(act_ptr)       // 128 int8 activations
           v_acc     = vmpyacc(v_acc, v_weights, v_acts)  // vector MAC
           // ... continue for all K elements
           result    = vreduce(v_acc)      // horizontal reduction
      d. Write result to output rpcmem buffer
 Synchronization:
 If synchronous mode (opt_opsync):
   Wait for DSP to complete each operation before enqueuing next
 Else:
   Pipeline multiple operations and synchronize at graph boundaries
   dspqueue_sync(queue)
 Memory Flow:
 Host writes to rpcmem buffer -> DSP reads same physical memory (cache-coherent)
 DSP writes result to rpcmem buffer -> Host reads same physical memory
 No explicit memcpy needed (FastRPC + ION/rpcmem handles coherency)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment