Implementation:Turboderp org Exllamav2 Softmax AVX2

Knowledge Sources	Turboderp_org_Exllamav2
Domains	Sampling, SIMD, Performance_Optimization
Last Updated	2026-02-15 00:00 GMT

Overview

AVX2-optimized implementation of the softmax function that converts raw logits into a probability distribution using SIMD vectorized operations for high-throughput CPU-side sampling.

Description

softmax_cpu_avx2 provides a performance-critical softmax implementation that leverages Intel AVX2 (256-bit SIMD) intrinsics to process 8 float values simultaneously. The function aligns the vocabulary size to a 32-element boundary for optimal vector processing.

The implementation handles three distinct code paths based on the exponent parameter:

exponent == 2.0f (fast path): Uses a squared subtraction approach where logit differences from the maximum are squared, negated via XOR with a sign mask, and then multiplied by the inverse temperature before exponentiation. This avoids the expensive powf call entirely by leveraging SIMD multiply and XOR operations.
exponent == 1.0f (standard path): The classic softmax with temperature. If temperature is exactly 1.0, the inverse-temperature multiply is skipped as an additional optimization. Uses exp256_ps (vectorized exp from avx_mathfun.h) for SIMD exponentiation.
exponent != 1.0f and != 2.0f (fallback path): Falls back to scalar powf and expf calls per element, as arbitrary exponents cannot be efficiently vectorized.

The normalization phase accumulates the exponential sum across 8 SIMD lanes, reduces it to a scalar, and then divides all probabilities by the sum using vectorized multiply with the reciprocal.

On non-x86 platforms (e.g., aarch64), a dummy fallback function is compiled that returns 0, ensuring the build does not fail.

Usage

This function is called as a drop-in replacement for the scalar softmax_cpu when the build detects AVX2 support (USE_AVX2 preprocessor macro). It is used in the sampling pipeline to convert model logits into probabilities before top-K, top-P, and other filtering stages are applied.

Code Reference

Source Location

Repository: Turboderp_org_Exllamav2
File: exllamav2/exllamav2_ext/cpp/sampling_avx2.cpp
Lines: 1-166

Signature

AVX2_TARGET
int softmax_cpu_avx2(
    const int vocab_size,
    const float temperature,
    const float* logits,
    const bool* logits_filter,
    const float exponent,
    float* output
);

Import

#include "sampling_avx2.h"

I/O Contract

Parameter	Type	Direction	Description
vocab_size	const int	in	Size of the vocabulary (number of logits)
temperature	const float	in	Softmax temperature; higher values produce more uniform distributions
logits	const float*	in	Raw logit values from the model, length = vocab_size
logits_filter	const bool*	in	Optional filter mask; NULL means all tokens allowed, true = allowed
exponent	const float	in	Exponent applied to logit differences (1.0 = standard, 2.0 = quadratic fast path)
output	float*	out	Probability distribution, must be aligned to 32 floats (vocab_size_aligned)

Return	Type	Description
max logit index	int	Index of the token with the highest raw logit value

Usage Examples

#include "sampling_avx2.h"

// Allocate aligned output buffer (32-element aligned)
int vocab_size = 32000;
int aligned_size = ((vocab_size + 31) / 32) * 32;
float* output = (float*)aligned_alloc(32, aligned_size * sizeof(float));

// Standard softmax with temperature=0.8
int max_idx = softmax_cpu_avx2(vocab_size, 0.8f, logits, nullptr, 1.0f, output);

// Quadratic softmax (exponent=2.0) with logit filtering
bool logit_filter[32000];
// ... set filter values ...
int max_idx2 = softmax_cpu_avx2(vocab_size, 1.0f, logits, logit_filter, 2.0f, output);

Related Pages

Turboderp_org_Exllamav2_Sampling_H -- Header declaring all sampling function signatures
Turboderp_org_Exllamav2_Ext_Norm -- GPU normalization operations

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment