Principle:Bitsandbytes foundation Bitsandbytes FP8 Simulated Quantization Matmul

Knowledge Sources	FP8 Formats for Deep Learning 8-bit Optimizers via Block-wise Quantization
Domains	Quantization, Research, Training
Last Updated	2026-02-07 13:31 GMT

Overview

A simulated FP8 training technique that quantizes activations and weights to 8-bit floating point representations before performing matrix multiplication, enabling research into low-precision training without dedicated hardware support.

Description

FP8 (8-bit floating point) formats are emerging as a key enabler for efficient deep learning training. Two standard formats exist: E4M3 (4-bit exponent, 3-bit mantissa) for forward passes offering higher precision, and E5M2 (5-bit exponent, 2-bit mantissa) for backward passes offering wider dynamic range. This principle simulates FP8 training by quantizing tensors to FP8 using lookup-table-based codebooks, then immediately dequantizing back to higher precision before performing standard floating-point matmul. This allows researchers to study the numerical effects of FP8 quantization without requiring hardware FP8 tensor cores.

Two quantization granularity strategies are studied: mixed quantization (blockwise for activations, global for weights) and global quantization (single scaling factor per tensor for both operands).

Usage

Apply this principle when investigating FP8 training numerics on hardware that lacks native FP8 support, or when comparing quantization granularity strategies (blockwise vs global) for their impact on model convergence and accuracy.

Theoretical Basis

The FP8 simulation follows a quantize-dequantize-compute pattern:

Forward pass:

# Pseudo-code for mixed FP8 matmul
A_fp8 = dequantize(quantize_blockwise(A, fp8_code), blocksize)
B_fp8 = dequantize(quantize_global(B, fp8_code))
output = matmul(A_fp8, B_fp8)

Backward pass:

# Gradients are also quantized through FP8
grad_fp8 = dequantize(quantize(grad_output, bw_code))
grad_A = matmul(grad_fp8, B.T)
grad_B = matmul(A.T, grad_output)

The key insight is that blockwise quantization preserves local dynamic range better than global quantization, at the cost of additional scaling factors per block.

Related Pages

Implementation:Bitsandbytes_foundation_Bitsandbytes_Research_FP8_Matmul

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment