Implementation:Sgl project Sglang CPU Vec SIMD
| Knowledge Sources | |
|---|---|
| Domains | CPU_Inference, SIMD_Vectorization, Type_Conversion |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
SIMD-optimized vectorized utility header providing type conversion helpers and AVX-512 intrinsics for BFloat16, Float16, and FP8 (E4M3) data types used by CPU kernels.
Description
The vec.h header builds on PyTorch's at::vec::Vectorized abstraction to provide specialized conversion routines for reduced-precision data types. It defines the CPU_CAPABILITY_AVX512 macro when AVX-512F, AVX-512BF16, and AMX-BF16 are available. The file provides three categories of functionality:
Generic conversion helpers: convert_from_float_ext converts float vectors back to reduced-precision types (BF16/FP16), and load_float_vec2 loads reduced-precision or float data into pairs of float vectors.
AVX-512 native conversions: A template specialization of convert_from_float_ext for at::BFloat16 uses the native _mm512_cvtne2ps_pbh instruction for efficient FP32-to-BF16 conversion. Macros CVT_BF16_TO_FP32 and CVT_FP16_TO_FP32 provide single-instruction conversions.
FP8 (E4M3) to BF16 converters: Three strategies are provided -- cvt_e4m3_bf16_intrinsic_no_nan (fast path without NaN handling), cvt_e4m3_bf16_intrinsic_without_denorm (handles NaN but not denormals), and cvt_e4m3_bf16_intrinsic_with_denorm (full accuracy including denormalized values). Each uses low-level AVX-512 bit manipulation with shifts, masks, and blends to reinterpret the 8-bit floating-point encoding into BF16 format.
Usage
Include this header in any CPU kernel implementation that needs efficient data type conversion between reduced-precision formats (BF16, FP16, FP8) and float32. It is a foundational utility used across all CPU kernel files in the sgl-kernel library.
Code Reference
Source Location
- Repository: Sgl_project_Sglang
- File: sgl-kernel/csrc/cpu/vec.h
- Lines: 1-378
Signature
// Generic float-to-reduced-precision conversion
template <typename scalar_t,
typename std::enable_if_t<is_reduced_floating_point_v<scalar_t>, int> = 0>
inline Vectorized<scalar_t> convert_from_float_ext(
const Vectorized<float>& a, const Vectorized<float>& b);
// Load reduced-precision data as float vector pairs
template <typename scalar_t,
typename std::enable_if_t<is_reduced_floating_point_v<scalar_t>, int> = 1>
inline std::tuple<Vectorized<float>, Vectorized<float>>
load_float_vec2(const scalar_t* __restrict__ data);
// AVX-512 BF16 specialization
template <>
inline Vectorized<at::BFloat16>
convert_from_float_ext<at::BFloat16>(
const Vectorized<float>& a, const Vectorized<float>& b);
// FP8 E4M3 to BF16 converters (AVX-512)
inline __m512bh cvt_e4m3_bf16_intrinsic_no_nan(__m256i fp8_vec);
inline __m512bh cvt_e4m3_bf16_intrinsic_without_denorm(__m256i fp8_vec);
inline __m512bh cvt_e4m3_bf16_intrinsic_with_denorm(__m256i fp8_vec);
Import
#include "vec.h"
// Underlying dependencies:
#include <ATen/cpu/vec/functional.h>
#include <ATen/cpu/vec/vec.h>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| a, b | Vectorized<float> | Yes | Float vector pair to convert to reduced precision |
| data | const scalar_t* | Yes | Pointer to reduced-precision data to load as float vectors |
| fp8_vec | __m256i | Yes | 32 packed FP8 E4M3 values for conversion to BF16 |
Outputs
| Name | Type | Description |
|---|---|---|
| result | Vectorized<scalar_t> | Converted reduced-precision vector (BF16 or FP16) |
| (x0, x1) | std::tuple<Vectorized<float>, Vectorized<float>> | Pair of float vectors loaded from reduced-precision data |
| bf16_vec | __m512bh | 32 BF16 values converted from FP8 E4M3 input |
Usage Examples
// Load BF16 data as float vector pair
const at::BFloat16* bf16_data = ...;
auto [vec0, vec1] = load_float_vec2(bf16_data);
// Convert float vectors back to BF16
auto bf16_result = convert_from_float_ext<at::BFloat16>(vec0, vec1);
// Convert FP8 E4M3 to BF16 using AVX-512 (fast path)
__m256i fp8_packed = _mm256_loadu_si256((__m256i*)fp8_data);
__m512bh bf16_converted = cvt_e4m3_bf16_intrinsic_no_nan(fp8_packed);