Principle:Bitsandbytes foundation Bitsandbytes FP8 Simulated Quantization Matmul
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Research, Training |
| Last Updated | 2026-02-07 13:31 GMT |
Overview
A simulated FP8 training technique that quantizes activations and weights to 8-bit floating point representations before performing matrix multiplication, enabling research into low-precision training without dedicated hardware support.
Description
FP8 (8-bit floating point) formats are emerging as a key enabler for efficient deep learning training. Two standard formats exist: E4M3 (4-bit exponent, 3-bit mantissa) for forward passes offering higher precision, and E5M2 (5-bit exponent, 2-bit mantissa) for backward passes offering wider dynamic range. This principle simulates FP8 training by quantizing tensors to FP8 using lookup-table-based codebooks, then immediately dequantizing back to higher precision before performing standard floating-point matmul. This allows researchers to study the numerical effects of FP8 quantization without requiring hardware FP8 tensor cores.
Two quantization granularity strategies are studied: mixed quantization (blockwise for activations, global for weights) and global quantization (single scaling factor per tensor for both operands).
Usage
Apply this principle when investigating FP8 training numerics on hardware that lacks native FP8 support, or when comparing quantization granularity strategies (blockwise vs global) for their impact on model convergence and accuracy.
Theoretical Basis
The FP8 simulation follows a quantize-dequantize-compute pattern:
Forward pass:
# Pseudo-code for mixed FP8 matmul
A_fp8 = dequantize(quantize_blockwise(A, fp8_code), blocksize)
B_fp8 = dequantize(quantize_global(B, fp8_code))
output = matmul(A_fp8, B_fp8)
Backward pass:
# Gradients are also quantized through FP8
grad_fp8 = dequantize(quantize(grad_output, bw_code))
grad_A = matmul(grad_fp8, B.T)
grad_B = matmul(A.T, grad_output)
The key insight is that blockwise quantization preserves local dynamic range better than global quantization, at the cost of additional scaling factors per block.