Principle:Bitsandbytes foundation Bitsandbytes FP8 Linear Layer
| Knowledge Sources | |
|---|---|
| Domains | Research, Quantization, Neural_Network_Modules |
| Last Updated | 2026-02-07 13:31 GMT |
Overview
A drop-in linear layer module that simulates FP8 precision training by quantizing activations and weights to 8-bit floating point before matrix multiplication.
Description
Traditional neural network linear layers operate in FP16 or FP32. FP8 linear layers reduce memory bandwidth and potentially computation cost by quantizing both operands to 8-bit floating point before the matrix multiplication. Two FP8 formats are used: E4M3 (higher precision, used for forward pass) and E5M2 (wider dynamic range, used for backward pass). The layer handles codebook initialization, block size selection, and integration with the autograd system transparently. This is a research technique for studying FP8 training feasibility on hardware without native FP8 support.
Usage
Apply when researching FP8 training dynamics. Replace standard nn.Linear layers to measure the impact of FP8 quantization on model accuracy and convergence.
Theoretical Basis
The FP8 linear layer computes:
Block sizes are auto-selected based on feature dimensions using a descending array [4096, 2048, 1024, 512, 256, 128, 64] to match the nearest power-of-two boundary.