Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Ggml Hexagon hvx arith

From Leeroopedia


Implementation Metadata
File Name src/ggml-hexagon/htp/hvx-arith.h
Repository ggml-org/ggml
Lines 457
Language C
Domain Tags DSP_Computing, SIMD_Intrinsics, Arithmetic
Status Active
Last Updated 2025-05-15 12:00 GMT
Knowledge Sources ggml-org/ggml repository

Overview

hvx-arith.h provides HVX-intrinsic implementations of element-wise binary arithmetic operations (add, subtract, multiply) on FP32 vectors, with alignment-aware variants. This is the core arithmetic building block used by binary-ops.c and other operations. The alignment-aware design maximizes HVX throughput by using faster aligned loads/stores when possible.

Description

The file defines a generic hvx_arith_loop_body macro that processes HVX vectors in a loop with #pragma unroll(4), handling full vectors and leftover elements. Architecture-conditional macros select between two instruction paths:

  • HVX arch < 79 -- Uses qfloat intermediate operations (Q6_Vqf32_vadd_VsfVsf + Q6_Vsf_equals_Vqf32)
  • HVX arch >= 79 -- Uses native FP operations (Q6_Vsf_vadd_VsfVsf)

For each binary operation (ADD, SUB, MUL), four alignment variants are provided:

  • _aa -- Both source and destination aligned (fastest)
  • _au -- Destination aligned, source1 unaligned
  • _ua -- Destination unaligned, source aligned
  • _uu -- Both unaligned (slowest, but always correct)

A generic dispatcher (e.g., hvx_add_f32) selects the appropriate variant based on runtime alignment checks.

Usage

Included as a header by operation files that need element-wise arithmetic.

Code Reference

Source Location

Repository File Lines
ggml-org/ggml src/ggml-hexagon/htp/hvx-arith.h 457

Key Signatures

// Architecture-conditional operation macros
#if __HVX_ARCH__ < 79
#define HVX_OP_ADD(a, b) Q6_Vsf_equals_Vqf32(Q6_Vqf32_vadd_VsfVsf(a, b))
#define HVX_OP_SUB(a, b) Q6_Vsf_equals_Vqf32(Q6_Vqf32_vsub_VsfVsf(a, b))
#define HVX_OP_MUL(a, b) Q6_Vsf_equals_Vqf32(Q6_Vqf32_vmpy_VsfVsf(a, b))
#else
#define HVX_OP_ADD(a, b) Q6_Vsf_vadd_VsfVsf(a, b)
#define HVX_OP_SUB(a, b) Q6_Vsf_vsub_VsfVsf(a, b)
#define HVX_OP_MUL(a, b) Q6_Vsf_vmpy_VsfVsf(a, b)
#endif

// Alignment variants for ADD
static inline void hvx_add_f32_aa(uint8_t * restrict dst, const uint8_t * restrict src0,
    const uint8_t * restrict src1, uint32_t n);
static inline void hvx_add_f32_au(uint8_t * restrict dst, const uint8_t * restrict src0,
    const uint8_t * restrict src1, uint32_t n);
static inline void hvx_add_f32_ua(...);
static inline void hvx_add_f32_uu(...);

// Generic dispatcher with runtime alignment detection
static inline void hvx_add_f32(uint8_t * dst, const uint8_t * src0, const uint8_t * src1, uint32_t n);

I/O Contract

Inputs

  • dst -- Destination buffer for result
  • src0, src1 -- Source operand buffers (FP32 elements)
  • n -- Number of FP32 elements to process

Outputs

  • dst -- Element-wise result (add, sub, or mul) written to destination buffer

Usage Examples

Used by binary-ops.c:

// Select optimized path based on alignment
hvx_elemwise_f32_func func = is_aligned ? func_table_HVX_opt[op] : func_table_HVX[op];

// Execute element-wise operation
func(dst_ptr, src0_ptr, src1_ptr, num_elements);

Related Pages

Implements Principle

Related Implementations

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment