Implementation:Ggml org Ggml Hexagon flash attn

**Implementation Metadata**
File Name	`src/ggml-hexagon/htp/flash-attn-ops.c`
Repository	ggml-org/ggml
Lines	684
Language	C
Domain Tags	ML_Infrastructure, DSP_Computing, Attention_Mechanism
Status	Active
Last Updated	2025-05-15 12:00 GMT
Knowledge Sources	ggml-org/ggml repository

Overview

flash-attn-ops.c is the DSP-side implementation of flash attention (extended) for the Hexagon HVX vector processor, computing scaled dot-product attention with online softmax. Flash attention is the most compute-intensive operation in transformer inference, and this implementation provides the core attention mechanism with mixed-precision (FP32/FP16) support.

Description

The file implements FP32-to-FP16 conversion via HVX intrinsics (hvx_load_f32_to_f16), which uses Q6_Vqf32_vsub_VsfVsf for intermediate qfloat conversion followed by Q6_Vh_vdeal_Vh for FP16 packing.

Mixed-precision dot products (hvx_dot_f32_f16_aa) load FP32 queries and FP16 keys/values, performing computation in half-precision with float accumulation. The function processes full HVX vectors (64 FP16 elements) in unrolled loops with leftover element handling via vector predicate masking (Q6_Q_vsetq_R).

Multi-row dot product variants (hvx_dot_f32_f16_aa_rx2) process two result rows simultaneously for better HVX throughput. Scratchpad memory is used for intermediate results, and work is parallelized across HVX threads by distributing attention heads.

Usage

Dispatched from the DSP-side message loop when the host sends flash attention operation requests.

Code Reference

Source Location

Repository	File	Lines
ggml-org/ggml	`src/ggml-hexagon/htp/flash-attn-ops.c`	684

Key Signatures

// FP32-to-FP16 conversion using HVX intrinsics
static inline HVX_Vector hvx_load_f32_to_f16(const HVX_Vector * restrict src, const HVX_Vector zero);

// Mixed-precision dot product (FP32 query x FP16 key/value)
static inline void hvx_dot_f32_f16_aa(float * restrict r, const void * restrict y,
    const void * restrict x, unsigned int n, float s);

// Dual-row dot product variant for higher throughput
static inline void hvx_dot_f32_f16_aa_rx2(float * restrict r, const void * restrict y,
    const void * restrict x0, const void * restrict x1, unsigned int n, float s);

I/O Contract

Inputs

Q tensor -- Query tensor (FP32)
K tensor -- Key tensor (FP16)
V tensor -- Value tensor (FP16)
Scale factor -- Attention scaling parameter

Outputs

O tensor -- Attention output with softmax-weighted value aggregation

Usage Examples

Internal flash attention computation:

// The attention kernel performs:
// 1. Q * K^T dot products (mixed FP32/FP16)
// 2. Online softmax normalization
// 3. Softmax(QK^T) * V accumulation
hvx_dot_f32_f16_aa(&score, query_row, key_row, head_dim, scale);

Related Pages

Implements Principle

Principle:Ggml_org_Ggml_Hexagon_DSP_Computation

Related Implementations

Implementation:Ggml_org_Ggml_Hexagon_softmax_ops -- Softmax used in attention
Implementation:Ggml_org_Ggml_Hexagon_matmul_ops -- Matrix multiplication operations
Implementation:Ggml_org_Ggml_Hexagon_htp_main -- Message dispatcher

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment