Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Bitsandbytes foundation Bitsandbytes FP8 Linear Layer

From Leeroopedia
Revision as of 17:10, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Bitsandbytes_foundation_Bitsandbytes_FP8_Linear_Layer.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Research, Quantization, Neural_Network_Modules
Last Updated 2026-02-07 13:31 GMT

Overview

A drop-in linear layer module that simulates FP8 precision training by quantizing activations and weights to 8-bit floating point before matrix multiplication.

Description

Traditional neural network linear layers operate in FP16 or FP32. FP8 linear layers reduce memory bandwidth and potentially computation cost by quantizing both operands to 8-bit floating point before the matrix multiplication. Two FP8 formats are used: E4M3 (higher precision, used for forward pass) and E5M2 (wider dynamic range, used for backward pass). The layer handles codebook initialization, block size selection, and integration with the autograd system transparently. This is a research technique for studying FP8 training feasibility on hardware without native FP8 support.

Usage

Apply when researching FP8 training dynamics. Replace standard nn.Linear layers to measure the impact of FP8 quantization on model accuracy and convergence.

Theoretical Basis

The FP8 linear layer computes:

Y=dequant(quantE4M3(X))dequant(quantE4M3(WT))

Block sizes are auto-selected based on feature dimensions using a descending array [4096, 2048, 1024, 512, 256, 128, 64] to match the nearest power-of-two boundary.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment