Implementation:Turboderp org Exllamav2 Ext Norm

Knowledge Sources	Turboderp_org_Exllamav2
Domains	Normalization, CUDA, C_Extension
Last Updated	2026-02-15 00:00 GMT

Overview

C++ extension implementing GPU-accelerated normalization operations including RMS norm, layer norm, and head norm, with support for FP16/FP32 inputs and tensor-parallel execution.

Description

ext_norm.cpp provides the pybind11-accessible wrappers around CUDA normalization kernels used throughout the ExLlamaV2 transformer pipeline. The file implements seven functions:

rms_norm(x, w, y, epsilon) applies Root Mean Square normalization. It supports both FP16 and FP32 input tensors (x) and output tensors (y), while the weight tensor (w) must be FP16. The function validates shape compatibility (x.dim(1) == w.dim(0), x.shape == y.shape) and dispatches to rms_norm_cuda on the appropriate CUDA stream.

rms_norm_tp(x, w, y, epsilon, tp_context) is the tensor-parallel variant that applies RMS norm independently across multiple devices. It iterates over the device list from the ExtTPContext, sets each device, and launches rms_norm_cuda on each device's dedicated stream.

rms_norm_(x, w, epsilon) is the in-place variant that calls rms_norm with the same tensor for both input and output.

layer_norm(x, w, b, y, epsilon) applies standard Layer Normalization with weight w and optional bias b. All tensors must be FP16. The bias can be a meta-device tensor (treated as NULL/no bias).

layer_norm_(x, w, b, epsilon) is the in-place variant of layer norm.

head_norm(x, w, b, y, epsilon, rms) applies per-head normalization, supporting both RMS and standard layer norm modes via the rms boolean flag. The tensor shapes are validated along the last two dimensions (-2 for num_heads, -1 for head_dim).

head_norm_(x, w, b, epsilon, rms) is the in-place variant of head norm.

Usage

These functions are called by the Python-side transformer layer modules during forward passes. rms_norm is used in attention and MLP pre-normalization for models like LLaMA. layer_norm is used for models with standard LayerNorm (e.g., GPT-2 style). head_norm is used for per-head query/key normalization in architectures that require it.

Code Reference

Source Location

Repository: Turboderp_org_Exllamav2
File: exllamav2/exllamav2_ext/ext_norm.cpp
Lines: 1-212

Signature

void rms_norm(
    torch::Tensor x,
    torch::Tensor w,
    torch::Tensor y,
    float epsilon
);

void rms_norm_tp(
    std::vector<torch::Tensor> x,
    std::vector<torch::Tensor> w,
    std::vector<torch::Tensor> y,
    float epsilon,
    uintptr_t tp_context
);

void rms_norm_(
    torch::Tensor x,
    torch::Tensor w,
    float epsilon
);

void layer_norm(
    torch::Tensor x,
    torch::Tensor w,
    torch::Tensor b,
    torch::Tensor y,
    float epsilon
);

void layer_norm_(
    torch::Tensor x,
    torch::Tensor w,
    torch::Tensor b,
    float epsilon
);

void head_norm(
    torch::Tensor x,
    torch::Tensor w,
    torch::Tensor b,
    torch::Tensor y,
    float epsilon,
    bool rms
);

void head_norm_(
    torch::Tensor x,
    torch::Tensor w,
    torch::Tensor b,
    float epsilon,
    bool rms
);

Import

from exllamav2 import exllamav2_ext as ext_c
ext_c.rms_norm(x, w, y, epsilon)
ext_c.layer_norm(x, w, b, y, epsilon)
ext_c.head_norm(x, w, b, y, epsilon, rms)

I/O Contract

Function	Parameter	Type	Direction	Description
rms_norm	x	Tensor (FP16 or FP32)	in	Input hidden states, shape (rows, dim)
rms_norm	w	Tensor (FP16)	in	Normalization weights, shape (dim,)
rms_norm	y	Tensor (FP16 or FP32)	out	Output tensor, same shape as x
rms_norm	epsilon	float	in	Small constant for numerical stability (typically 1e-6)
layer_norm	x	Tensor (FP16)	in	Input hidden states, shape (rows, dim)
layer_norm	w	Tensor (FP16)	in	Normalization weights, shape (dim,)
layer_norm	b	Tensor (FP16) or meta	in	Optional bias, shape (dim,); meta tensor means no bias
layer_norm	y	Tensor (FP16)	out	Output tensor, same shape as x
head_norm	x	Tensor (FP16)	in	Input states, shape (batch, num_heads, head_dim)
head_norm	w	Tensor (FP16)	in	Per-head weights, shape (num_heads, head_dim)
head_norm	b	Tensor (FP16) or meta	in	Optional per-head bias
head_norm	rms	bool	in	If true, use RMS norm; if false, use layer norm

Usage Examples

import torch
from exllamav2 import exllamav2_ext as ext_c

# RMS Normalization
x = torch.randn(4, 4096, dtype=torch.float16, device="cuda")
w = torch.ones(4096, dtype=torch.float16, device="cuda")
y = torch.empty_like(x)
ext_c.rms_norm(x, w, y, 1e-6)

# In-place RMS Normalization
ext_c.rms_norm_(x, w, 1e-6)

# Layer Normalization with bias
b = torch.zeros(4096, dtype=torch.float16, device="cuda")
ext_c.layer_norm(x, w, b, y, 1e-5)

# Head Normalization (RMS mode)
x_heads = torch.randn(4, 32, 128, dtype=torch.float16, device="cuda")
w_heads = torch.ones(32, 128, dtype=torch.float16, device="cuda")
b_heads = torch.empty(0, device="meta")  # no bias
y_heads = torch.empty_like(x_heads)
ext_c.head_norm(x_heads, w_heads, b_heads, y_heads, 1e-6, True)

Related Pages

Turboderp_org_Exllamav2_Ext_QAttn_H -- Quantized attention that uses normalization in its forward pass
Turboderp_org_Exllamav2_Ext_QMLP_H -- Quantized MLP that uses normalization in its forward pass
Turboderp_org_Exllamav2_Ext_TP_H -- Tensor parallelism context used by rms_norm_tp

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment