Implementation:Sgl project Sglang CPU Activation
| Knowledge Sources | |
|---|---|
| Domains | CPU Inference, Activation Functions |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
CPU-optimized fused activation-and-multiply functions (SiLU, GELU-tanh, GELU) using SIMD vectorization for LLM inference.
Description
activation.cpp implements three fused gated activation functions that are fundamental building blocks of modern LLM architectures (such as LLaMA FFN layers). The core is a templated act_and_mul_kernel_impl function that splits an input tensor of shape [num_tokens, 2*d] into two halves, applies an activation function to the first half, multiplies it element-wise with the second half, and writes the result of shape [num_tokens, d].
The implementation uses ATen vectorized operations (at::vec::Vectorized) for SIMD acceleration on bfloat16/float16 data with float32 intermediate computation. It parallelizes across tokens via at::parallel_for and uses #pragma GCC unroll 4 for loop unrolling. A scalar fallback handles the tail elements that do not fill a complete SIMD vector.
Three public functions are exposed:
- silu_and_mul_cpu: Implements SiLU (x * sigmoid(x)) gated multiplication
- gelu_tanh_and_mul_cpu: Implements GELU with tanh approximation gated multiplication
- gelu_and_mul_cpu: Implements standard GELU (using erf) gated multiplication
All three are dispatched for reduced floating-point types via AT_DISPATCH_REDUCED_FLOATING_TYPES.
Usage
Use these functions as drop-in replacements for GPU activation kernels when running LLM inference on CPU. They fuse the activation and multiplication into a single vectorized kernel, avoiding the overhead of separate PyTorch operations.
Code Reference
Source Location
- Repository: Sgl_project_Sglang
- File: sgl-kernel/csrc/cpu/activation.cpp
- Lines: 1-135
Signature
// Internal template (anonymous namespace)
template <typename scalar_t, typename func_t, typename vec_func_t>
void act_and_mul_kernel_impl(
scalar_t* __restrict__ output,
const scalar_t* __restrict__ input,
int64_t num_tokens,
int64_t dim,
const func_t& f,
const vec_func_t& vf);
// Public API
at::Tensor silu_and_mul_cpu(at::Tensor& input);
at::Tensor gelu_tanh_and_mul_cpu(const at::Tensor& input);
at::Tensor gelu_and_mul_cpu(const at::Tensor& input);
Import
#include "common.h"
#include "vec.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input | at::Tensor | Yes | Input tensor of shape [num_tokens, 2*d] with bfloat16 or float16 dtype |
Outputs
| Name | Type | Description |
|---|---|---|
| output | at::Tensor | Result tensor of shape [num_tokens, d] with same dtype as input |
Usage Examples
// Called from PyTorch C++ extension:
// SiLU gated activation
at::Tensor input = /* shape [batch, 2 * hidden_dim] */;
at::Tensor output = silu_and_mul_cpu(input);
// output shape: [batch, hidden_dim]
// GELU-tanh gated activation
at::Tensor output_gelu = gelu_tanh_and_mul_cpu(input);
// Standard GELU gated activation
at::Tensor output_gelu_std = gelu_and_mul_cpu(input);