Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Turboderp org Exllamav2 Softmax AVX2

From Leeroopedia
Revision as of 14:02, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Turboderp_org_Exllamav2_Softmax_AVX2.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Sampling, SIMD, Performance_Optimization
Last Updated 2026-02-15 00:00 GMT

Overview

AVX2-optimized implementation of the softmax function that converts raw logits into a probability distribution using SIMD vectorized operations for high-throughput CPU-side sampling.

Description

softmax_cpu_avx2 provides a performance-critical softmax implementation that leverages Intel AVX2 (256-bit SIMD) intrinsics to process 8 float values simultaneously. The function aligns the vocabulary size to a 32-element boundary for optimal vector processing.

The implementation handles three distinct code paths based on the exponent parameter:

  • exponent == 2.0f (fast path): Uses a squared subtraction approach where logit differences from the maximum are squared, negated via XOR with a sign mask, and then multiplied by the inverse temperature before exponentiation. This avoids the expensive powf call entirely by leveraging SIMD multiply and XOR operations.
  • exponent == 1.0f (standard path): The classic softmax with temperature. If temperature is exactly 1.0, the inverse-temperature multiply is skipped as an additional optimization. Uses exp256_ps (vectorized exp from avx_mathfun.h) for SIMD exponentiation.
  • exponent != 1.0f and != 2.0f (fallback path): Falls back to scalar powf and expf calls per element, as arbitrary exponents cannot be efficiently vectorized.

The normalization phase accumulates the exponential sum across 8 SIMD lanes, reduces it to a scalar, and then divides all probabilities by the sum using vectorized multiply with the reciprocal.

On non-x86 platforms (e.g., aarch64), a dummy fallback function is compiled that returns 0, ensuring the build does not fail.

Usage

This function is called as a drop-in replacement for the scalar softmax_cpu when the build detects AVX2 support (USE_AVX2 preprocessor macro). It is used in the sampling pipeline to convert model logits into probabilities before top-K, top-P, and other filtering stages are applied.

Code Reference

Source Location

Signature

AVX2_TARGET
int softmax_cpu_avx2(
    const int vocab_size,
    const float temperature,
    const float* logits,
    const bool* logits_filter,
    const float exponent,
    float* output
);

Import

#include "sampling_avx2.h"

I/O Contract

Parameter Type Direction Description
vocab_size const int in Size of the vocabulary (number of logits)
temperature const float in Softmax temperature; higher values produce more uniform distributions
logits const float* in Raw logit values from the model, length = vocab_size
logits_filter const bool* in Optional filter mask; NULL means all tokens allowed, true = allowed
exponent const float in Exponent applied to logit differences (1.0 = standard, 2.0 = quadratic fast path)
output float* out Probability distribution, must be aligned to 32 floats (vocab_size_aligned)
Return Type Description
max logit index int Index of the token with the highest raw logit value

Usage Examples

#include "sampling_avx2.h"

// Allocate aligned output buffer (32-element aligned)
int vocab_size = 32000;
int aligned_size = ((vocab_size + 31) / 32) * 32;
float* output = (float*)aligned_alloc(32, aligned_size * sizeof(float));

// Standard softmax with temperature=0.8
int max_idx = softmax_cpu_avx2(vocab_size, 0.8f, logits, nullptr, 1.0f, output);

// Quadratic softmax (exponent=2.0) with logit filtering
bool logit_filter[32000];
// ... set filter values ...
int max_idx2 = softmax_cpu_avx2(vocab_size, 1.0f, logits, logit_filter, 2.0f, output);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment