Implementation:Vllm project Vllm SGL GEMM FP8
| Knowledge Sources | |
|---|---|
| Domains | CPU_Inference, GEMM, Quantization, FP8 |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Implements FP8 E4M3 to BF16 online dequantization and weight-only FP8 GEMM (w8a16) using AVX-512 intrinsics for CPU inference.
Description
This file provides the FP8 weight-only inference pathway where weights are stored in Float8_e4m3fn format and activations remain in BFloat16 or Half. The unpack_B function performs online dequantization of FP8 weights to BF16 with per-block scaling, using AVX-512 exponent/mantissa manipulation via the CVT_FP8_TO_BF16 macro. The tinygemm_kernel_nn template then computes the actual GEMM using AMX BF16 dot-product instructions on the dequantized tiles.
The fp8_scaled_mm_cpu public API supports block-wise quantization scales (with configurable block_size_N and block_size_K), optional bias, and automatic VNNI weight packing. It allocates per-thread temporary buffers for the dequantized weight tiles and FP32 accumulators.
Usage
This code is compiled as part of the vLLM SGL-kernels CPU extension. It is invoked for FP8 quantized model inference on CPU, reducing memory bandwidth requirements by storing weights in 8-bit format while computing in BF16 precision.
Code Reference
Source Location
- Repository: vllm
- File: csrc/cpu/sgl-kernels/gemm_fp8.cpp
- Lines: 1-530
Signature
inline void unpack_B(
at::BFloat16* Btmp,
const at::Float8_e4m3fn* packed_B,
int N, int K, int ldb, int ldb_tmp, float scale);
at::Tensor fp8_scaled_mm_cpu(
at::Tensor& mat1,
at::Tensor& mat2,
at::Tensor& scales2,
std::vector<int64_t> block_size,
std::optional<at::Tensor>& bias,
at::ScalarType out_dtype,
bool is_vnni);
template <typename scalar_t>
void tinygemm_kernel(
const scalar_t* A,
const at::Float8_e4m3fn* B,
scalar_t* C,
scalar_t* Btmp,
float* Ctmp,
const float* scale,
int64_t M, int64_t N, int64_t K,
int64_t lda, int64_t ldb, int64_t ldc,
bool brg, int64_t block_size_K);
Import
#include "common.h"
#include "vec.h"
#include "gemm.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| mat1 | at::Tensor [M, K] | Yes | Activation matrix in BFloat16 or Half |
| mat2 | at::Tensor [N, K] | Yes | Weight matrix in Float8_e4m3fn format |
| scales2 | at::Tensor [N/block_size_N, K/block_size_K] | Yes | Per-block quantization scales for the FP8 weights (float32) |
| block_size | std::vector<int64_t> {block_size_N, block_size_K} | Yes | Block dimensions for blockwise quantization; block_size_K must equal BLOCK_K |
| bias | at::Tensor [N] (float) | No | Optional bias vector added after the scaled matmul |
| out_dtype | at::ScalarType | Yes | Output data type (must match mat1 dtype) |
| is_vnni | bool | Yes | Whether mat2 is already in VNNI-packed format |
Outputs
| Name | Type | Description |
|---|---|---|
| out | at::Tensor [M, N] | Result of the FP8 scaled matrix multiplication in out_dtype |
Usage Examples
// FP8 weight-only scaled matmul
at::Tensor output = fp8_scaled_mm_cpu(
activations, // [M, K] BFloat16
fp8_weights, // [N, K] Float8_e4m3fn
scales, // [N/block_N, K/block_K] float32
{128, 128}, // block_size = {block_size_N, block_size_K}
bias, // optional [N] float32
at::kBFloat16, // output dtype
/*is_vnni=*/false);