Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Sgl project Sglang CPU Activation

From Leeroopedia


Knowledge Sources
Domains CPU Inference, Activation Functions
Last Updated 2026-02-10 00:00 GMT

Overview

CPU-optimized fused activation-and-multiply functions (SiLU, GELU-tanh, GELU) using SIMD vectorization for LLM inference.

Description

activation.cpp implements three fused gated activation functions that are fundamental building blocks of modern LLM architectures (such as LLaMA FFN layers). The core is a templated act_and_mul_kernel_impl function that splits an input tensor of shape [num_tokens, 2*d] into two halves, applies an activation function to the first half, multiplies it element-wise with the second half, and writes the result of shape [num_tokens, d].

The implementation uses ATen vectorized operations (at::vec::Vectorized) for SIMD acceleration on bfloat16/float16 data with float32 intermediate computation. It parallelizes across tokens via at::parallel_for and uses #pragma GCC unroll 4 for loop unrolling. A scalar fallback handles the tail elements that do not fill a complete SIMD vector.

Three public functions are exposed:

  • silu_and_mul_cpu: Implements SiLU (x * sigmoid(x)) gated multiplication
  • gelu_tanh_and_mul_cpu: Implements GELU with tanh approximation gated multiplication
  • gelu_and_mul_cpu: Implements standard GELU (using erf) gated multiplication

All three are dispatched for reduced floating-point types via AT_DISPATCH_REDUCED_FLOATING_TYPES.

Usage

Use these functions as drop-in replacements for GPU activation kernels when running LLM inference on CPU. They fuse the activation and multiplication into a single vectorized kernel, avoiding the overhead of separate PyTorch operations.

Code Reference

Source Location

Signature

// Internal template (anonymous namespace)
template <typename scalar_t, typename func_t, typename vec_func_t>
void act_and_mul_kernel_impl(
    scalar_t* __restrict__ output,
    const scalar_t* __restrict__ input,
    int64_t num_tokens,
    int64_t dim,
    const func_t& f,
    const vec_func_t& vf);

// Public API
at::Tensor silu_and_mul_cpu(at::Tensor& input);
at::Tensor gelu_tanh_and_mul_cpu(const at::Tensor& input);
at::Tensor gelu_and_mul_cpu(const at::Tensor& input);

Import

#include "common.h"
#include "vec.h"

I/O Contract

Inputs

Name Type Required Description
input at::Tensor Yes Input tensor of shape [num_tokens, 2*d] with bfloat16 or float16 dtype

Outputs

Name Type Description
output at::Tensor Result tensor of shape [num_tokens, d] with same dtype as input

Usage Examples

// Called from PyTorch C++ extension:
// SiLU gated activation
at::Tensor input = /* shape [batch, 2 * hidden_dim] */;
at::Tensor output = silu_and_mul_cpu(input);
// output shape: [batch, hidden_dim]

// GELU-tanh gated activation
at::Tensor output_gelu = gelu_tanh_and_mul_cpu(input);

// Standard GELU gated activation
at::Tensor output_gelu_std = gelu_and_mul_cpu(input);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment