Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Mit han lab Llm awq Fused Normalization

From Leeroopedia
Revision as of 17:39, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Mit_han_lab_Llm_awq_Fused_Normalization.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

Kernel replacement technique that substitutes PyTorch RMSNorm with CUDA-optimized implementations for reduced inference latency.

Description

Standard PyTorch RMSNorm involves multiple GPU kernel launches (variance, normalization, scaling). The fused CUDA implementation performs all operations in a single kernel call via the awq_inference_engine extension, reducing kernel launch overhead and memory round-trips. This is particularly impactful during autoregressive decoding where normalization is called once per token per layer.

Usage

Applied to TinyChat models alongside fused attention before running inference.

Theoretical Basis

RMSNorm:

y = x / sqrt(mean(x^2) + epsilon) * gamma

This is fused into a single CUDA kernel launch, eliminating intermediate memory reads and writes.

Related Pages

Knowledge Sources

Domains

  • Inference
  • Optimization

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment