Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Mit han lab Llm awq Fused Attention Optimization

From Leeroopedia

Overview

Kernel fusion technique that combines Q, K, V projections into a single quantized GEMM and uses specialized attention kernels for prefilling and decoding phases.

Description

Standard transformer attention involves separate Q, K, V linear projections followed by scaled dot-product attention. Fused attention optimization applies three key techniques:

  1. QKV Fusion: Merges three separate WQLinear layers (q_proj, k_proj, v_proj) into a single fused WQLinear that performs one GEMM instead of three
  2. FlashAttention for Prefilling: Uses FlashAttention for processing the full context during the prefilling phase
  3. FasterTransformer Decoding: Uses FasterTransformer-style CUDA kernels for single-token decoding with a pre-allocated KV cache

This significantly reduces kernel launch overhead and memory bandwidth requirements.

Usage

Applied to TinyChat models before running inference to maximize throughput.

Theoretical Basis

QKV fusion:

[Q;K;V] = W_qkv @ x

This performs one GEMM instead of three separate projections.

  • FlashAttention provides O(N) memory prefilling
  • FasterTransformer masked MHA provides O(1) decoding with pre-allocated KV cache

Related Pages

Knowledge Sources

Domains

  • Inference
  • Optimization

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment