Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Vllm project Vllm SGL Vec

From Leeroopedia


Knowledge Sources
Domains SIMD Vectorization, Data Type Conversion
Last Updated 2026-02-08 00:00 GMT

Overview

Defines SGLang vector type conversion utilities including FP8, FP16, and BF16 to FP32 converters using AVX512 intrinsics for CPU-optimized kernel operations.

Description

This header provides low-level SIMD conversion routines critical for quantized inference on CPU. It includes convert_from_float_ext for efficient float-to-reduced-precision conversion using native AVX512-BF16 instructions, multiple FP8 (E4M3) to BF16 conversion variants (cvt_e4m3_bf16_intrinsic_no_nan, cvt_e4m3_bf16_intrinsic_with_denorm, cvt_e4m3_bf16_intrinsic_without_denorm), and scalar reduction functions (vec_reduce_sum, vec_reduce_max). It also provides quantize_row_int8 for dynamic per-row INT8 quantization used in w8a8 MoE kernels.

Usage

This header is included by the SGL-kernels MoE implementations (moe_fp8.cpp and moe_int8.cpp). It is compiled when building the vLLM CPU extension with AVX512 support enabled.

Code Reference

Source Location

Signature

// Float to reduced precision conversion (with AVX512-BF16 specialization)
template <typename scalar_t>
inline Vectorized<scalar_t> convert_from_float_ext(
    const Vectorized<float>& a, const Vectorized<float>& b);

// FP8 E4M3 to BF16 conversion (multiple variants)
inline __m512bh cvt_e4m3_bf16_intrinsic_no_nan(__m256i fp8_vec);
inline __m512bh cvt_e4m3_bf16_intrinsic_without_denorm(__m256i fp8_vec);
inline __m512bh cvt_e4m3_bf16_intrinsic_with_denorm(__m256i fp8_vec);
inline __m512bh CVT_FP8_TO_BF16(__m256i a);

// Scalar reduction functions
inline float vec_reduce_sum(const Vectorized<float>& a);
inline float vec_reduce_max(const Vectorized<float>& a);

// Dynamic INT8 row quantization
template <typename scalar_t>
inline void quantize_row_int8(uint8_t* __restrict__ Aq, float& As,
    const scalar_t* __restrict__ A, int64_t K, float eps = 1e-7);

Import

#include "vec.h"

I/O Contract

Inputs

Name Type Required Description
fp8_vec __m256i Yes 32 packed FP8 E4M3 values for conversion to BF16
a, b Vectorized<float> Yes Two float vectors to convert to reduced precision scalar_t
A const scalar_t* Yes Input row data for INT8 quantization
K int64_t Yes Number of elements in the row to quantize

Outputs

Name Type Description
(return) __m512bh 32 BF16 values converted from FP8 input
(return) Vectorized<scalar_t> Reduced precision vector converted from float pairs
Aq uint8_t* Quantized INT8 output row
As float& Computed quantization scale for the row

Usage Examples

// Convert FP8 weights to BF16 for GEMM computation
__m256i fp8_data = _mm256_loadu_si256((__m256i*)fp8_ptr);
__m512bh bf16_data = CVT_FP8_TO_BF16(fp8_data);

// Convert float pair back to BFloat16
Vectorized<float> f0, f1;
auto bf16_vec = convert_from_float_ext<at::BFloat16>(f0, f1);

// Quantize a row to INT8 with dynamic scaling
float scale;
quantize_row_int8<at::BFloat16>(quant_buf, scale, input_row, K);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment