Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Turboderp org Exllamav2 Ext Norm

From Leeroopedia
Revision as of 14:01, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Turboderp_org_Exllamav2_Ext_Norm.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Normalization, CUDA, C_Extension
Last Updated 2026-02-15 00:00 GMT

Overview

C++ extension implementing GPU-accelerated normalization operations including RMS norm, layer norm, and head norm, with support for FP16/FP32 inputs and tensor-parallel execution.

Description

ext_norm.cpp provides the pybind11-accessible wrappers around CUDA normalization kernels used throughout the ExLlamaV2 transformer pipeline. The file implements seven functions:

rms_norm(x, w, y, epsilon) applies Root Mean Square normalization. It supports both FP16 and FP32 input tensors (x) and output tensors (y), while the weight tensor (w) must be FP16. The function validates shape compatibility (x.dim(1) == w.dim(0), x.shape == y.shape) and dispatches to rms_norm_cuda on the appropriate CUDA stream.

rms_norm_tp(x, w, y, epsilon, tp_context) is the tensor-parallel variant that applies RMS norm independently across multiple devices. It iterates over the device list from the ExtTPContext, sets each device, and launches rms_norm_cuda on each device's dedicated stream.

rms_norm_(x, w, epsilon) is the in-place variant that calls rms_norm with the same tensor for both input and output.

layer_norm(x, w, b, y, epsilon) applies standard Layer Normalization with weight w and optional bias b. All tensors must be FP16. The bias can be a meta-device tensor (treated as NULL/no bias).

layer_norm_(x, w, b, epsilon) is the in-place variant of layer norm.

head_norm(x, w, b, y, epsilon, rms) applies per-head normalization, supporting both RMS and standard layer norm modes via the rms boolean flag. The tensor shapes are validated along the last two dimensions (-2 for num_heads, -1 for head_dim).

head_norm_(x, w, b, epsilon, rms) is the in-place variant of head norm.

Usage

These functions are called by the Python-side transformer layer modules during forward passes. rms_norm is used in attention and MLP pre-normalization for models like LLaMA. layer_norm is used for models with standard LayerNorm (e.g., GPT-2 style). head_norm is used for per-head query/key normalization in architectures that require it.

Code Reference

Source Location

Signature

void rms_norm(
    torch::Tensor x,
    torch::Tensor w,
    torch::Tensor y,
    float epsilon
);

void rms_norm_tp(
    std::vector<torch::Tensor> x,
    std::vector<torch::Tensor> w,
    std::vector<torch::Tensor> y,
    float epsilon,
    uintptr_t tp_context
);

void rms_norm_(
    torch::Tensor x,
    torch::Tensor w,
    float epsilon
);

void layer_norm(
    torch::Tensor x,
    torch::Tensor w,
    torch::Tensor b,
    torch::Tensor y,
    float epsilon
);

void layer_norm_(
    torch::Tensor x,
    torch::Tensor w,
    torch::Tensor b,
    float epsilon
);

void head_norm(
    torch::Tensor x,
    torch::Tensor w,
    torch::Tensor b,
    torch::Tensor y,
    float epsilon,
    bool rms
);

void head_norm_(
    torch::Tensor x,
    torch::Tensor w,
    torch::Tensor b,
    float epsilon,
    bool rms
);

Import

from exllamav2 import exllamav2_ext as ext_c
ext_c.rms_norm(x, w, y, epsilon)
ext_c.layer_norm(x, w, b, y, epsilon)
ext_c.head_norm(x, w, b, y, epsilon, rms)

I/O Contract

Function Parameter Type Direction Description
rms_norm x Tensor (FP16 or FP32) in Input hidden states, shape (rows, dim)
rms_norm w Tensor (FP16) in Normalization weights, shape (dim,)
rms_norm y Tensor (FP16 or FP32) out Output tensor, same shape as x
rms_norm epsilon float in Small constant for numerical stability (typically 1e-6)
layer_norm x Tensor (FP16) in Input hidden states, shape (rows, dim)
layer_norm w Tensor (FP16) in Normalization weights, shape (dim,)
layer_norm b Tensor (FP16) or meta in Optional bias, shape (dim,); meta tensor means no bias
layer_norm y Tensor (FP16) out Output tensor, same shape as x
head_norm x Tensor (FP16) in Input states, shape (batch, num_heads, head_dim)
head_norm w Tensor (FP16) in Per-head weights, shape (num_heads, head_dim)
head_norm b Tensor (FP16) or meta in Optional per-head bias
head_norm rms bool in If true, use RMS norm; if false, use layer norm

Usage Examples

import torch
from exllamav2 import exllamav2_ext as ext_c

# RMS Normalization
x = torch.randn(4, 4096, dtype=torch.float16, device="cuda")
w = torch.ones(4096, dtype=torch.float16, device="cuda")
y = torch.empty_like(x)
ext_c.rms_norm(x, w, y, 1e-6)

# In-place RMS Normalization
ext_c.rms_norm_(x, w, 1e-6)

# Layer Normalization with bias
b = torch.zeros(4096, dtype=torch.float16, device="cuda")
ext_c.layer_norm(x, w, b, y, 1e-5)

# Head Normalization (RMS mode)
x_heads = torch.randn(4, 32, 128, dtype=torch.float16, device="cuda")
w_heads = torch.ones(32, 128, dtype=torch.float16, device="cuda")
b_heads = torch.empty(0, device="meta")  # no bias
y_heads = torch.empty_like(x_heads)
ext_c.head_norm(x_heads, w_heads, b_heads, y_heads, 1e-6, True)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment