Implementation:Turboderp org Exllamav2 Ext QMLP
| Knowledge Sources | |
|---|---|
| Domains | MLP, Quantization, CUDA, Mixture_of_Experts |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
C++ extension implementing quantized MLP (gate/up/down projections with fused LayerNorm and activation) and Mixture-of-Experts MLP layers, including a tensor-parallel variant for multi-GPU inference.
Description
ext_qmlp.cpp provides the feed-forward network computation for ExLlamaV2 transformer layers, supporting both standard gated MLP and MoE architectures.
Standard MLP
- make_q_mlp -- Constructs a
QMLPobject from layernorm weights, three quantized projection matrices (gate, up, down), temporary buffers, and configuration flags. Supports SiLU (default) or GeLU activation viaact_gelu. The residual connection can operate in FP32 mode whenresidual_fp32=true. Returns an opaque handle.
- q_mlp_forward_ -- Executes the full MLP forward pass in-place on tensor x: (1) apply LayerNorm, (2) project through gate and up matrices, (3) apply activation * elementwise multiply, (4) project through down matrix, (5) add residual. Validates that x dimensions match the up projection height and that row count does not exceed
max_rows.
- q_mlp_set_loras -- Configures LoRA adapter matrices for gate, up, and down projections. Returns the maximum rank across all adapters for buffer allocation.
Mixture-of-Experts MLP
- make_q_moe_mlp -- Constructs a
QMoEMLPobject for MoE layers. Takes a router gate matrix (transposed FP16), per-expert quantized w1/w2/w3 matrices (as vectors of QMatrix handles), and configuration parameters includingnum_expertsandnum_experts_per_token. The gate matrix dimensions are validated against the layernorm and expert count.
- q_moe_mlp_forward_ -- Executes the MoE forward pass: (1) apply LayerNorm, (2) compute router logits to select top-k experts, (3) gather tokens per expert, (4) run each expert's gated MLP, (5) scatter and sum expert outputs.
Tensor-Parallel MLP
- tp_mlp_forward_ -- Tensor-parallel version of the standard MLP. Broadcasts hidden states to all devices, applies per-device layernorm, computes gate/up projections in parallel, applies activation/multiply, gathers intermediate results, projects through down, adds residual, and gathers final output. Uses barrier synchronization for device coordination and supports multithreaded execution via the TP context's thread pool.
Usage
Use make_q_mlp during model initialization for standard transformer MLP layers, and make_q_moe_mlp for Mixture-of-Experts layers (e.g., Mixtral). Call the corresponding forward function during inference. Use tp_mlp_forward_ for tensor-parallel multi-GPU configurations.
Code Reference
Source Location
- Repository: Turboderp_org_Exllamav2
- File: exllamav2/exllamav2_ext/ext_qmlp.cpp
- Lines: 1-473
Signature
uintptr_t make_q_mlp(
torch::Tensor layernorm,
torch::Tensor layernorm_bias,
bool layernorm_is_rms,
float norm_epsilon,
uintptr_t q_gate,
uintptr_t q_up,
uintptr_t q_down,
torch::Tensor temp_state,
torch::Tensor temp_a,
torch::Tensor temp_b,
torch::Tensor temp_dq,
int max_rows,
bool act_gelu,
bool has_residual,
torch::Tensor post_layernorm,
torch::Tensor post_layernorm_bias,
bool residual_fp32,
bool use_graphs
);
void q_mlp_forward_(
uintptr_t q_mlp,
torch::Tensor x,
const std::vector<uintptr_t>& loras,
torch::Tensor loras_temp
);
int q_mlp_set_loras(
uintptr_t q_mlp,
std::unordered_map<uintptr_t, torch::Tensor>& gate_proj_lora_a,
std::unordered_map<uintptr_t, torch::Tensor>& gate_proj_lora_b,
std::unordered_map<uintptr_t, torch::Tensor>& up_proj_lora_a,
std::unordered_map<uintptr_t, torch::Tensor>& up_proj_lora_b,
std::unordered_map<uintptr_t, torch::Tensor>& down_proj_lora_a,
std::unordered_map<uintptr_t, torch::Tensor>& down_proj_lora_b
);
uintptr_t make_q_moe_mlp(
torch::Tensor layernorm,
torch::Tensor layernorm_bias,
bool layernorm_is_rms,
float norm_epsilon,
torch::Tensor gate,
int num_experts,
int num_experts_per_token,
const std::vector<uintptr_t>& w1,
const std::vector<uintptr_t>& w2,
const std::vector<uintptr_t>& w3,
torch::Tensor temp_state,
torch::Tensor temp_gathered_state,
torch::Tensor temp_a,
torch::Tensor temp_b,
torch::Tensor temp_logits,
torch::Tensor temp_dq,
int max_rows,
bool act_gelu
);
void q_moe_mlp_forward_(
uintptr_t q_moe_mlp,
torch::Tensor x
);
void tp_mlp_forward_(
uintptr_t tp_context,
torch::Tensor hidden_states,
const std::vector<torch::Tensor> &temp_bc0_,
const std::vector<torch::Tensor> &temp_bc1_,
const std::vector<torch::Tensor> &temp_bc2_,
const std::vector<torch::Tensor> &temp_gate_,
const std::vector<torch::Tensor> &temp_up_,
const std::vector<torch::Tensor> &temp_down_,
const std::vector<torch::Tensor> &pre_layernorm,
float norm_epsilon,
const std::vector<uintptr_t> &gate,
const std::vector<uintptr_t> &up,
const std::vector<uintptr_t> &down,
bool act_gelu
);
Import
from exllamav2.ext import exllamav2_ext as ext_c
I/O Contract
Inputs
| Parameter | Type | Description |
|---|---|---|
| layernorm | torch.Tensor (kHalf) |
Pre-MLP layer norm weights |
| q_gate, q_up, q_down | uintptr_t |
QMatrix handles for gate, up, and down projections |
| x | torch.Tensor (kHalf or kFloat) |
Input hidden states; modified in-place |
| loras | std::vector<uintptr_t> |
Active LoRA adapter handles |
| gate (MoE) | torch.Tensor (kHalf) |
Router gate matrix, shape [num_experts, hidden_dim] (transposed) |
| num_experts | int | Total number of experts in MoE layer |
| num_experts_per_token | int | Top-k experts selected per token |
| w1, w2, w3 | std::vector<uintptr_t> |
Per-expert QMatrix handles for MoE projections |
| tp_context | uintptr_t |
Tensor parallelism context (TP variant) |
Outputs
| Function | Return | Description |
|---|---|---|
| make_q_mlp | uintptr_t |
Opaque handle to the QMLP object |
| q_mlp_forward_ | void | Modifies x in-place with MLP output + residual |
| q_mlp_set_loras | int | Maximum LoRA rank across all projections |
| make_q_moe_mlp | uintptr_t |
Opaque handle to the QMoEMLP object |
| q_moe_mlp_forward_ | void | Modifies x in-place with MoE MLP output |
| tp_mlp_forward_ | void | Writes final output into the gathered result tensors |
Usage Examples
from exllamav2.ext import exllamav2_ext as ext_c
# Create standard gated MLP layer
mlp_handle = ext_c.make_q_mlp(
layernorm, layernorm_bias, True, 1e-6,
gate_handle, up_handle, down_handle,
temp_state, temp_a, temp_b, temp_dq,
max_rows=2048, act_gelu=False, has_residual=True,
post_layernorm, post_layernorm_bias,
residual_fp32=False, use_graphs=False
)
# Forward pass (in-place on hidden_states)
ext_c.q_mlp_forward_(mlp_handle, hidden_states, loras, loras_temp)
# Create MoE MLP layer (e.g., Mixtral with 8 experts, top-2)
moe_handle = ext_c.make_q_moe_mlp(
layernorm, layernorm_bias, True, 1e-6,
gate_weight, num_experts=8, num_experts_per_token=2,
w1_handles, w2_handles, w3_handles,
temp_state, temp_gathered, temp_a, temp_b, temp_logits, temp_dq,
max_rows=2048, act_gelu=False
)
# MoE forward pass
ext_c.q_moe_mlp_forward_(moe_handle, hidden_states)