Principle:Mit han lab Llm awq Fused Normalization
Overview
Kernel replacement technique that substitutes PyTorch RMSNorm with CUDA-optimized implementations for reduced inference latency.
Description
Standard PyTorch RMSNorm involves multiple GPU kernel launches (variance, normalization, scaling). The fused CUDA implementation performs all operations in a single kernel call via the awq_inference_engine extension, reducing kernel launch overhead and memory round-trips. This is particularly impactful during autoregressive decoding where normalization is called once per token per layer.
Usage
Applied to TinyChat models alongside fused attention before running inference.
Theoretical Basis
RMSNorm:
y = x / sqrt(mean(x^2) + epsilon) * gamma
This is fused into a single CUDA kernel launch, eliminating intermediate memory reads and writes.
Related Pages
Knowledge Sources
- Repo|llm-awq|https://github.com/mit-han-lab/llm-awq
Domains
- Inference
- Optimization