Implementation:Turboderp org Exllamav2 Ext Norm
| Knowledge Sources | |
|---|---|
| Domains | Normalization, CUDA, C_Extension |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
C++ extension implementing GPU-accelerated normalization operations including RMS norm, layer norm, and head norm, with support for FP16/FP32 inputs and tensor-parallel execution.
Description
ext_norm.cpp provides the pybind11-accessible wrappers around CUDA normalization kernels used throughout the ExLlamaV2 transformer pipeline. The file implements seven functions:
rms_norm(x, w, y, epsilon) applies Root Mean Square normalization. It supports both FP16 and FP32 input tensors (x) and output tensors (y), while the weight tensor (w) must be FP16. The function validates shape compatibility (x.dim(1) == w.dim(0), x.shape == y.shape) and dispatches to rms_norm_cuda on the appropriate CUDA stream.
rms_norm_tp(x, w, y, epsilon, tp_context) is the tensor-parallel variant that applies RMS norm independently across multiple devices. It iterates over the device list from the ExtTPContext, sets each device, and launches rms_norm_cuda on each device's dedicated stream.
rms_norm_(x, w, epsilon) is the in-place variant that calls rms_norm with the same tensor for both input and output.
layer_norm(x, w, b, y, epsilon) applies standard Layer Normalization with weight w and optional bias b. All tensors must be FP16. The bias can be a meta-device tensor (treated as NULL/no bias).
layer_norm_(x, w, b, epsilon) is the in-place variant of layer norm.
head_norm(x, w, b, y, epsilon, rms) applies per-head normalization, supporting both RMS and standard layer norm modes via the rms boolean flag. The tensor shapes are validated along the last two dimensions (-2 for num_heads, -1 for head_dim).
head_norm_(x, w, b, epsilon, rms) is the in-place variant of head norm.
Usage
These functions are called by the Python-side transformer layer modules during forward passes. rms_norm is used in attention and MLP pre-normalization for models like LLaMA. layer_norm is used for models with standard LayerNorm (e.g., GPT-2 style). head_norm is used for per-head query/key normalization in architectures that require it.
Code Reference
Source Location
- Repository: Turboderp_org_Exllamav2
- File: exllamav2/exllamav2_ext/ext_norm.cpp
- Lines: 1-212
Signature
void rms_norm(
torch::Tensor x,
torch::Tensor w,
torch::Tensor y,
float epsilon
);
void rms_norm_tp(
std::vector<torch::Tensor> x,
std::vector<torch::Tensor> w,
std::vector<torch::Tensor> y,
float epsilon,
uintptr_t tp_context
);
void rms_norm_(
torch::Tensor x,
torch::Tensor w,
float epsilon
);
void layer_norm(
torch::Tensor x,
torch::Tensor w,
torch::Tensor b,
torch::Tensor y,
float epsilon
);
void layer_norm_(
torch::Tensor x,
torch::Tensor w,
torch::Tensor b,
float epsilon
);
void head_norm(
torch::Tensor x,
torch::Tensor w,
torch::Tensor b,
torch::Tensor y,
float epsilon,
bool rms
);
void head_norm_(
torch::Tensor x,
torch::Tensor w,
torch::Tensor b,
float epsilon,
bool rms
);
Import
from exllamav2 import exllamav2_ext as ext_c
ext_c.rms_norm(x, w, y, epsilon)
ext_c.layer_norm(x, w, b, y, epsilon)
ext_c.head_norm(x, w, b, y, epsilon, rms)
I/O Contract
| Function | Parameter | Type | Direction | Description |
|---|---|---|---|---|
| rms_norm | x | Tensor (FP16 or FP32) | in | Input hidden states, shape (rows, dim) |
| rms_norm | w | Tensor (FP16) | in | Normalization weights, shape (dim,) |
| rms_norm | y | Tensor (FP16 or FP32) | out | Output tensor, same shape as x |
| rms_norm | epsilon | float | in | Small constant for numerical stability (typically 1e-6) |
| layer_norm | x | Tensor (FP16) | in | Input hidden states, shape (rows, dim) |
| layer_norm | w | Tensor (FP16) | in | Normalization weights, shape (dim,) |
| layer_norm | b | Tensor (FP16) or meta | in | Optional bias, shape (dim,); meta tensor means no bias |
| layer_norm | y | Tensor (FP16) | out | Output tensor, same shape as x |
| head_norm | x | Tensor (FP16) | in | Input states, shape (batch, num_heads, head_dim) |
| head_norm | w | Tensor (FP16) | in | Per-head weights, shape (num_heads, head_dim) |
| head_norm | b | Tensor (FP16) or meta | in | Optional per-head bias |
| head_norm | rms | bool | in | If true, use RMS norm; if false, use layer norm |
Usage Examples
import torch
from exllamav2 import exllamav2_ext as ext_c
# RMS Normalization
x = torch.randn(4, 4096, dtype=torch.float16, device="cuda")
w = torch.ones(4096, dtype=torch.float16, device="cuda")
y = torch.empty_like(x)
ext_c.rms_norm(x, w, y, 1e-6)
# In-place RMS Normalization
ext_c.rms_norm_(x, w, 1e-6)
# Layer Normalization with bias
b = torch.zeros(4096, dtype=torch.float16, device="cuda")
ext_c.layer_norm(x, w, b, y, 1e-5)
# Head Normalization (RMS mode)
x_heads = torch.randn(4, 32, 128, dtype=torch.float16, device="cuda")
w_heads = torch.ones(32, 128, dtype=torch.float16, device="cuda")
b_heads = torch.empty(0, device="meta") # no bias
y_heads = torch.empty_like(x_heads)
ext_c.head_norm(x_heads, w_heads, b_heads, y_heads, 1e-6, True)
Related Pages
- Turboderp_org_Exllamav2_Ext_QAttn_H -- Quantized attention that uses normalization in its forward pass
- Turboderp_org_Exllamav2_Ext_QMLP_H -- Quantized MLP that uses normalization in its forward pass
- Turboderp_org_Exllamav2_Ext_TP_H -- Tensor parallelism context used by rms_norm_tp