Implementation:Turboderp org Exllamav2 Softmax AVX2
| Knowledge Sources | |
|---|---|
| Domains | Sampling, SIMD, Performance_Optimization |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
AVX2-optimized implementation of the softmax function that converts raw logits into a probability distribution using SIMD vectorized operations for high-throughput CPU-side sampling.
Description
softmax_cpu_avx2 provides a performance-critical softmax implementation that leverages Intel AVX2 (256-bit SIMD) intrinsics to process 8 float values simultaneously. The function aligns the vocabulary size to a 32-element boundary for optimal vector processing.
The implementation handles three distinct code paths based on the exponent parameter:
- exponent == 2.0f (fast path): Uses a squared subtraction approach where logit differences from the maximum are squared, negated via XOR with a sign mask, and then multiplied by the inverse temperature before exponentiation. This avoids the expensive powf call entirely by leveraging SIMD multiply and XOR operations.
- exponent == 1.0f (standard path): The classic softmax with temperature. If temperature is exactly 1.0, the inverse-temperature multiply is skipped as an additional optimization. Uses exp256_ps (vectorized exp from avx_mathfun.h) for SIMD exponentiation.
- exponent != 1.0f and != 2.0f (fallback path): Falls back to scalar powf and expf calls per element, as arbitrary exponents cannot be efficiently vectorized.
The normalization phase accumulates the exponential sum across 8 SIMD lanes, reduces it to a scalar, and then divides all probabilities by the sum using vectorized multiply with the reciprocal.
On non-x86 platforms (e.g., aarch64), a dummy fallback function is compiled that returns 0, ensuring the build does not fail.
Usage
This function is called as a drop-in replacement for the scalar softmax_cpu when the build detects AVX2 support (USE_AVX2 preprocessor macro). It is used in the sampling pipeline to convert model logits into probabilities before top-K, top-P, and other filtering stages are applied.
Code Reference
Source Location
- Repository: Turboderp_org_Exllamav2
- File: exllamav2/exllamav2_ext/cpp/sampling_avx2.cpp
- Lines: 1-166
Signature
AVX2_TARGET
int softmax_cpu_avx2(
const int vocab_size,
const float temperature,
const float* logits,
const bool* logits_filter,
const float exponent,
float* output
);
Import
#include "sampling_avx2.h"
I/O Contract
| Parameter | Type | Direction | Description |
|---|---|---|---|
| vocab_size | const int | in | Size of the vocabulary (number of logits) |
| temperature | const float | in | Softmax temperature; higher values produce more uniform distributions |
| logits | const float* | in | Raw logit values from the model, length = vocab_size |
| logits_filter | const bool* | in | Optional filter mask; NULL means all tokens allowed, true = allowed |
| exponent | const float | in | Exponent applied to logit differences (1.0 = standard, 2.0 = quadratic fast path) |
| output | float* | out | Probability distribution, must be aligned to 32 floats (vocab_size_aligned) |
| Return | Type | Description |
|---|---|---|
| max logit index | int | Index of the token with the highest raw logit value |
Usage Examples
#include "sampling_avx2.h"
// Allocate aligned output buffer (32-element aligned)
int vocab_size = 32000;
int aligned_size = ((vocab_size + 31) / 32) * 32;
float* output = (float*)aligned_alloc(32, aligned_size * sizeof(float));
// Standard softmax with temperature=0.8
int max_idx = softmax_cpu_avx2(vocab_size, 0.8f, logits, nullptr, 1.0f, output);
// Quadratic softmax (exponent=2.0) with logit filtering
bool logit_filter[32000];
// ... set filter values ...
int max_idx2 = softmax_cpu_avx2(vocab_size, 1.0f, logits, logit_filter, 2.0f, output);
Related Pages
- Turboderp_org_Exllamav2_Sampling_H -- Header declaring all sampling function signatures
- Turboderp_org_Exllamav2_Ext_Norm -- GPU normalization operations