Principle:Mit han lab Llm awq Fused Normalization

Overview

Kernel replacement technique that substitutes PyTorch RMSNorm with CUDA-optimized implementations for reduced inference latency.

Description

Standard PyTorch RMSNorm involves multiple GPU kernel launches (variance, normalization, scaling). The fused CUDA implementation performs all operations in a single kernel call via the awq_inference_engine extension, reducing kernel launch overhead and memory round-trips. This is particularly impactful during autoregressive decoding where normalization is called once per token per layer.

Usage

Applied to TinyChat models alongside fused attention before running inference.

Theoretical Basis

RMSNorm:

y = x / sqrt(mean(x^2) + epsilon) * gamma

This is fused into a single CUDA kernel launch, eliminating intermediate memory reads and writes.

Related Pages

Implementation:Mit_han_lab_Llm_awq_Make_quant_norm

Knowledge Sources

Repo|llm-awq|https://github.com/mit-han-lab/llm-awq

Domains

Inference
Optimization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment