Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Ggml Hexagon flash attn

From Leeroopedia


Implementation Metadata
File Name src/ggml-hexagon/htp/flash-attn-ops.c
Repository ggml-org/ggml
Lines 684
Language C
Domain Tags ML_Infrastructure, DSP_Computing, Attention_Mechanism
Status Active
Last Updated 2025-05-15 12:00 GMT
Knowledge Sources ggml-org/ggml repository

Overview

flash-attn-ops.c is the DSP-side implementation of flash attention (extended) for the Hexagon HVX vector processor, computing scaled dot-product attention with online softmax. Flash attention is the most compute-intensive operation in transformer inference, and this implementation provides the core attention mechanism with mixed-precision (FP32/FP16) support.

Description

The file implements FP32-to-FP16 conversion via HVX intrinsics (hvx_load_f32_to_f16), which uses Q6_Vqf32_vsub_VsfVsf for intermediate qfloat conversion followed by Q6_Vh_vdeal_Vh for FP16 packing.

Mixed-precision dot products (hvx_dot_f32_f16_aa) load FP32 queries and FP16 keys/values, performing computation in half-precision with float accumulation. The function processes full HVX vectors (64 FP16 elements) in unrolled loops with leftover element handling via vector predicate masking (Q6_Q_vsetq_R).

Multi-row dot product variants (hvx_dot_f32_f16_aa_rx2) process two result rows simultaneously for better HVX throughput. Scratchpad memory is used for intermediate results, and work is parallelized across HVX threads by distributing attention heads.

Usage

Dispatched from the DSP-side message loop when the host sends flash attention operation requests.

Code Reference

Source Location

Repository File Lines
ggml-org/ggml src/ggml-hexagon/htp/flash-attn-ops.c 684

Key Signatures

// FP32-to-FP16 conversion using HVX intrinsics
static inline HVX_Vector hvx_load_f32_to_f16(const HVX_Vector * restrict src, const HVX_Vector zero);

// Mixed-precision dot product (FP32 query x FP16 key/value)
static inline void hvx_dot_f32_f16_aa(float * restrict r, const void * restrict y,
    const void * restrict x, unsigned int n, float s);

// Dual-row dot product variant for higher throughput
static inline void hvx_dot_f32_f16_aa_rx2(float * restrict r, const void * restrict y,
    const void * restrict x0, const void * restrict x1, unsigned int n, float s);

I/O Contract

Inputs

  • Q tensor -- Query tensor (FP32)
  • K tensor -- Key tensor (FP16)
  • V tensor -- Value tensor (FP16)
  • Scale factor -- Attention scaling parameter

Outputs

  • O tensor -- Attention output with softmax-weighted value aggregation

Usage Examples

Internal flash attention computation:

// The attention kernel performs:
// 1. Q * K^T dot products (mixed FP32/FP16)
// 2. Online softmax normalization
// 3. Softmax(QK^T) * V accumulation
hvx_dot_f32_f16_aa(&score, query_row, key_row, head_dim, scale);

Related Pages

Implements Principle

Related Implementations

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment