Implementation:Sgl project Sglang CPU Vec SIMD

Knowledge Sources	Sgl_project_Sglang
Domains	CPU_Inference, SIMD_Vectorization, Type_Conversion
Last Updated	2026-02-10 00:00 GMT

Overview

SIMD-optimized vectorized utility header providing type conversion helpers and AVX-512 intrinsics for BFloat16, Float16, and FP8 (E4M3) data types used by CPU kernels.

Description

The vec.h header builds on PyTorch's at::vec::Vectorized abstraction to provide specialized conversion routines for reduced-precision data types. It defines the CPU_CAPABILITY_AVX512 macro when AVX-512F, AVX-512BF16, and AMX-BF16 are available. The file provides three categories of functionality:

Generic conversion helpers: convert_from_float_ext converts float vectors back to reduced-precision types (BF16/FP16), and load_float_vec2 loads reduced-precision or float data into pairs of float vectors.

AVX-512 native conversions: A template specialization of convert_from_float_ext for at::BFloat16 uses the native _mm512_cvtne2ps_pbh instruction for efficient FP32-to-BF16 conversion. Macros CVT_BF16_TO_FP32 and CVT_FP16_TO_FP32 provide single-instruction conversions.

FP8 (E4M3) to BF16 converters: Three strategies are provided -- cvt_e4m3_bf16_intrinsic_no_nan (fast path without NaN handling), cvt_e4m3_bf16_intrinsic_without_denorm (handles NaN but not denormals), and cvt_e4m3_bf16_intrinsic_with_denorm (full accuracy including denormalized values). Each uses low-level AVX-512 bit manipulation with shifts, masks, and blends to reinterpret the 8-bit floating-point encoding into BF16 format.

Usage

Include this header in any CPU kernel implementation that needs efficient data type conversion between reduced-precision formats (BF16, FP16, FP8) and float32. It is a foundational utility used across all CPU kernel files in the sgl-kernel library.

Code Reference

Source Location

Repository: Sgl_project_Sglang
File: sgl-kernel/csrc/cpu/vec.h
Lines: 1-378

Signature

// Generic float-to-reduced-precision conversion
template <typename scalar_t,
    typename std::enable_if_t<is_reduced_floating_point_v<scalar_t>, int> = 0>
inline Vectorized<scalar_t> convert_from_float_ext(
    const Vectorized<float>& a, const Vectorized<float>& b);

// Load reduced-precision data as float vector pairs
template <typename scalar_t,
    typename std::enable_if_t<is_reduced_floating_point_v<scalar_t>, int> = 1>
inline std::tuple<Vectorized<float>, Vectorized<float>>
load_float_vec2(const scalar_t* __restrict__ data);

// AVX-512 BF16 specialization
template <>
inline Vectorized<at::BFloat16>
convert_from_float_ext<at::BFloat16>(
    const Vectorized<float>& a, const Vectorized<float>& b);

// FP8 E4M3 to BF16 converters (AVX-512)
inline __m512bh cvt_e4m3_bf16_intrinsic_no_nan(__m256i fp8_vec);
inline __m512bh cvt_e4m3_bf16_intrinsic_without_denorm(__m256i fp8_vec);
inline __m512bh cvt_e4m3_bf16_intrinsic_with_denorm(__m256i fp8_vec);

Import

#include "vec.h"

// Underlying dependencies:
#include <ATen/cpu/vec/functional.h>
#include <ATen/cpu/vec/vec.h>

I/O Contract

Inputs

Name	Type	Required	Description
a, b	Vectorized<float>	Yes	Float vector pair to convert to reduced precision
data	const scalar_t*	Yes	Pointer to reduced-precision data to load as float vectors
fp8_vec	__m256i	Yes	32 packed FP8 E4M3 values for conversion to BF16

Outputs

Name	Type	Description
result	Vectorized<scalar_t>	Converted reduced-precision vector (BF16 or FP16)
(x0, x1)	std::tuple<Vectorized<float>, Vectorized<float>>	Pair of float vectors loaded from reduced-precision data
bf16_vec	__m512bh	32 BF16 values converted from FP8 E4M3 input

Usage Examples

// Load BF16 data as float vector pair
const at::BFloat16* bf16_data = ...;
auto [vec0, vec1] = load_float_vec2(bf16_data);

// Convert float vectors back to BF16
auto bf16_result = convert_from_float_ext<at::BFloat16>(vec0, vec1);

// Convert FP8 E4M3 to BF16 using AVX-512 (fast path)
__m256i fp8_packed = _mm256_loadu_si256((__m256i*)fp8_data);
__m512bh bf16_converted = cvt_e4m3_bf16_intrinsic_no_nan(fp8_packed);

Related Pages

Environment:Sgl_project_Sglang_CPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment