Principle:Bitsandbytes foundation Bitsandbytes FP8 Linear Layer

Knowledge Sources	FP8 Formats for Deep Learning
Domains	Research, Quantization, Neural_Network_Modules
Last Updated	2026-02-07 13:31 GMT

Overview

A drop-in linear layer module that simulates FP8 precision training by quantizing activations and weights to 8-bit floating point before matrix multiplication.

Description

Traditional neural network linear layers operate in FP16 or FP32. FP8 linear layers reduce memory bandwidth and potentially computation cost by quantizing both operands to 8-bit floating point before the matrix multiplication. Two FP8 formats are used: E4M3 (higher precision, used for forward pass) and E5M2 (wider dynamic range, used for backward pass). The layer handles codebook initialization, block size selection, and integration with the autograd system transparently. This is a research technique for studying FP8 training feasibility on hardware without native FP8 support.

Usage

Apply when researching FP8 training dynamics. Replace standard nn.Linear layers to measure the impact of FP8 quantization on model accuracy and convergence.

Theoretical Basis

The FP8 linear layer computes:

$Y = dequant ({quant}_{E 4 M 3} (X)) \cdot dequant ({quant}_{E 4 M 3} (W^{T}))$

Block sizes are auto-selected based on feature dimensions using a descending array [4096, 2048, 1024, 512, 256, 128, 64] to match the nearest power-of-two boundary.

Related Pages

Implementation:Bitsandbytes_foundation_Bitsandbytes_LinearFP8

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment