Implementation:Turboderp org Exllamav2 Ext QMLP

Knowledge Sources	Turboderp_org_Exllamav2
Domains	MLP, Quantization, CUDA, Mixture_of_Experts
Last Updated	2026-02-15 00:00 GMT

Overview

C++ extension implementing quantized MLP (gate/up/down projections with fused LayerNorm and activation) and Mixture-of-Experts MLP layers, including a tensor-parallel variant for multi-GPU inference.

Description

ext_qmlp.cpp provides the feed-forward network computation for ExLlamaV2 transformer layers, supporting both standard gated MLP and MoE architectures.

Standard MLP

make_q_mlp -- Constructs a QMLP object from layernorm weights, three quantized projection matrices (gate, up, down), temporary buffers, and configuration flags. Supports SiLU (default) or GeLU activation via act_gelu. The residual connection can operate in FP32 mode when residual_fp32=true. Returns an opaque handle.

q_mlp_forward_ -- Executes the full MLP forward pass in-place on tensor x: (1) apply LayerNorm, (2) project through gate and up matrices, (3) apply activation * elementwise multiply, (4) project through down matrix, (5) add residual. Validates that x dimensions match the up projection height and that row count does not exceed max_rows.

q_mlp_set_loras -- Configures LoRA adapter matrices for gate, up, and down projections. Returns the maximum rank across all adapters for buffer allocation.

Mixture-of-Experts MLP

make_q_moe_mlp -- Constructs a QMoEMLP object for MoE layers. Takes a router gate matrix (transposed FP16), per-expert quantized w1/w2/w3 matrices (as vectors of QMatrix handles), and configuration parameters including num_experts and num_experts_per_token. The gate matrix dimensions are validated against the layernorm and expert count.

q_moe_mlp_forward_ -- Executes the MoE forward pass: (1) apply LayerNorm, (2) compute router logits to select top-k experts, (3) gather tokens per expert, (4) run each expert's gated MLP, (5) scatter and sum expert outputs.

Tensor-Parallel MLP

tp_mlp_forward_ -- Tensor-parallel version of the standard MLP. Broadcasts hidden states to all devices, applies per-device layernorm, computes gate/up projections in parallel, applies activation/multiply, gathers intermediate results, projects through down, adds residual, and gathers final output. Uses barrier synchronization for device coordination and supports multithreaded execution via the TP context's thread pool.

Usage

Use make_q_mlp during model initialization for standard transformer MLP layers, and make_q_moe_mlp for Mixture-of-Experts layers (e.g., Mixtral). Call the corresponding forward function during inference. Use tp_mlp_forward_ for tensor-parallel multi-GPU configurations.

Code Reference

Source Location

Repository: Turboderp_org_Exllamav2
File: exllamav2/exllamav2_ext/ext_qmlp.cpp
Lines: 1-473

Signature

uintptr_t make_q_mlp(
    torch::Tensor layernorm,
    torch::Tensor layernorm_bias,
    bool layernorm_is_rms,
    float norm_epsilon,
    uintptr_t q_gate,
    uintptr_t q_up,
    uintptr_t q_down,
    torch::Tensor temp_state,
    torch::Tensor temp_a,
    torch::Tensor temp_b,
    torch::Tensor temp_dq,
    int max_rows,
    bool act_gelu,
    bool has_residual,
    torch::Tensor post_layernorm,
    torch::Tensor post_layernorm_bias,
    bool residual_fp32,
    bool use_graphs
);

void q_mlp_forward_(
    uintptr_t q_mlp,
    torch::Tensor x,
    const std::vector<uintptr_t>& loras,
    torch::Tensor loras_temp
);

int q_mlp_set_loras(
    uintptr_t q_mlp,
    std::unordered_map<uintptr_t, torch::Tensor>& gate_proj_lora_a,
    std::unordered_map<uintptr_t, torch::Tensor>& gate_proj_lora_b,
    std::unordered_map<uintptr_t, torch::Tensor>& up_proj_lora_a,
    std::unordered_map<uintptr_t, torch::Tensor>& up_proj_lora_b,
    std::unordered_map<uintptr_t, torch::Tensor>& down_proj_lora_a,
    std::unordered_map<uintptr_t, torch::Tensor>& down_proj_lora_b
);

uintptr_t make_q_moe_mlp(
    torch::Tensor layernorm,
    torch::Tensor layernorm_bias,
    bool layernorm_is_rms,
    float norm_epsilon,
    torch::Tensor gate,
    int num_experts,
    int num_experts_per_token,
    const std::vector<uintptr_t>& w1,
    const std::vector<uintptr_t>& w2,
    const std::vector<uintptr_t>& w3,
    torch::Tensor temp_state,
    torch::Tensor temp_gathered_state,
    torch::Tensor temp_a,
    torch::Tensor temp_b,
    torch::Tensor temp_logits,
    torch::Tensor temp_dq,
    int max_rows,
    bool act_gelu
);

void q_moe_mlp_forward_(
    uintptr_t q_moe_mlp,
    torch::Tensor x
);

void tp_mlp_forward_(
    uintptr_t tp_context,
    torch::Tensor hidden_states,
    const std::vector<torch::Tensor> &temp_bc0_,
    const std::vector<torch::Tensor> &temp_bc1_,
    const std::vector<torch::Tensor> &temp_bc2_,
    const std::vector<torch::Tensor> &temp_gate_,
    const std::vector<torch::Tensor> &temp_up_,
    const std::vector<torch::Tensor> &temp_down_,
    const std::vector<torch::Tensor> &pre_layernorm,
    float norm_epsilon,
    const std::vector<uintptr_t> &gate,
    const std::vector<uintptr_t> &up,
    const std::vector<uintptr_t> &down,
    bool act_gelu
);

Import

from exllamav2.ext import exllamav2_ext as ext_c

I/O Contract

Inputs

Parameter	Type	Description
layernorm	`torch.Tensor` (kHalf)	Pre-MLP layer norm weights
q_gate, q_up, q_down	`uintptr_t`	QMatrix handles for gate, up, and down projections
x	`torch.Tensor` (kHalf or kFloat)	Input hidden states; modified in-place
loras	`std::vector<uintptr_t>`	Active LoRA adapter handles
gate (MoE)	`torch.Tensor` (kHalf)	Router gate matrix, shape [num_experts, hidden_dim] (transposed)
num_experts	int	Total number of experts in MoE layer
num_experts_per_token	int	Top-k experts selected per token
w1, w2, w3	`std::vector<uintptr_t>`	Per-expert QMatrix handles for MoE projections
tp_context	`uintptr_t`	Tensor parallelism context (TP variant)

Outputs

Function	Return	Description
make_q_mlp	`uintptr_t`	Opaque handle to the QMLP object
q_mlp_forward_	void	Modifies x in-place with MLP output + residual
q_mlp_set_loras	int	Maximum LoRA rank across all projections
make_q_moe_mlp	`uintptr_t`	Opaque handle to the QMoEMLP object
q_moe_mlp_forward_	void	Modifies x in-place with MoE MLP output
tp_mlp_forward_	void	Writes final output into the gathered result tensors

Usage Examples

from exllamav2.ext import exllamav2_ext as ext_c

# Create standard gated MLP layer
mlp_handle = ext_c.make_q_mlp(
    layernorm, layernorm_bias, True, 1e-6,
    gate_handle, up_handle, down_handle,
    temp_state, temp_a, temp_b, temp_dq,
    max_rows=2048, act_gelu=False, has_residual=True,
    post_layernorm, post_layernorm_bias,
    residual_fp32=False, use_graphs=False
)

# Forward pass (in-place on hidden_states)
ext_c.q_mlp_forward_(mlp_handle, hidden_states, loras, loras_temp)

# Create MoE MLP layer (e.g., Mixtral with 8 experts, top-2)
moe_handle = ext_c.make_q_moe_mlp(
    layernorm, layernorm_bias, True, 1e-6,
    gate_weight, num_experts=8, num_experts_per_token=2,
    w1_handles, w2_handles, w3_handles,
    temp_state, temp_gathered, temp_a, temp_b, temp_logits, temp_dq,
    max_rows=2048, act_gelu=False
)

# MoE forward pass
ext_c.q_moe_mlp_forward_(moe_handle, hidden_states)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment