Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Vllm project Vllm SGL GEMM FP8

From Leeroopedia


Knowledge Sources
Domains CPU_Inference, GEMM, Quantization, FP8
Last Updated 2026-02-08 00:00 GMT

Overview

Implements FP8 E4M3 to BF16 online dequantization and weight-only FP8 GEMM (w8a16) using AVX-512 intrinsics for CPU inference.

Description

This file provides the FP8 weight-only inference pathway where weights are stored in Float8_e4m3fn format and activations remain in BFloat16 or Half. The unpack_B function performs online dequantization of FP8 weights to BF16 with per-block scaling, using AVX-512 exponent/mantissa manipulation via the CVT_FP8_TO_BF16 macro. The tinygemm_kernel_nn template then computes the actual GEMM using AMX BF16 dot-product instructions on the dequantized tiles.

The fp8_scaled_mm_cpu public API supports block-wise quantization scales (with configurable block_size_N and block_size_K), optional bias, and automatic VNNI weight packing. It allocates per-thread temporary buffers for the dequantized weight tiles and FP32 accumulators.

Usage

This code is compiled as part of the vLLM SGL-kernels CPU extension. It is invoked for FP8 quantized model inference on CPU, reducing memory bandwidth requirements by storing weights in 8-bit format while computing in BF16 precision.

Code Reference

Source Location

Signature

inline void unpack_B(
    at::BFloat16* Btmp,
    const at::Float8_e4m3fn* packed_B,
    int N, int K, int ldb, int ldb_tmp, float scale);

at::Tensor fp8_scaled_mm_cpu(
    at::Tensor& mat1,
    at::Tensor& mat2,
    at::Tensor& scales2,
    std::vector<int64_t> block_size,
    std::optional<at::Tensor>& bias,
    at::ScalarType out_dtype,
    bool is_vnni);

template <typename scalar_t>
void tinygemm_kernel(
    const scalar_t* A,
    const at::Float8_e4m3fn* B,
    scalar_t* C,
    scalar_t* Btmp,
    float* Ctmp,
    const float* scale,
    int64_t M, int64_t N, int64_t K,
    int64_t lda, int64_t ldb, int64_t ldc,
    bool brg, int64_t block_size_K);

Import

#include "common.h"
#include "vec.h"
#include "gemm.h"

I/O Contract

Inputs

Name Type Required Description
mat1 at::Tensor [M, K] Yes Activation matrix in BFloat16 or Half
mat2 at::Tensor [N, K] Yes Weight matrix in Float8_e4m3fn format
scales2 at::Tensor [N/block_size_N, K/block_size_K] Yes Per-block quantization scales for the FP8 weights (float32)
block_size std::vector<int64_t> {block_size_N, block_size_K} Yes Block dimensions for blockwise quantization; block_size_K must equal BLOCK_K
bias at::Tensor [N] (float) No Optional bias vector added after the scaled matmul
out_dtype at::ScalarType Yes Output data type (must match mat1 dtype)
is_vnni bool Yes Whether mat2 is already in VNNI-packed format

Outputs

Name Type Description
out at::Tensor [M, N] Result of the FP8 scaled matrix multiplication in out_dtype

Usage Examples

// FP8 weight-only scaled matmul
at::Tensor output = fp8_scaled_mm_cpu(
    activations,     // [M, K] BFloat16
    fp8_weights,     // [N, K] Float8_e4m3fn
    scales,          // [N/block_N, K/block_K] float32
    {128, 128},      // block_size = {block_size_N, block_size_K}
    bias,            // optional [N] float32
    at::kBFloat16,   // output dtype
    /*is_vnni=*/false);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment