Principle:Bitsandbytes foundation Bitsandbytes SwitchBack Quantized Linear

Knowledge Sources	Bitsandbytes
Domains	Quantization, INT8, Training
Last Updated	2026-02-07 13:31 GMT

Overview

An INT8 quantized linear layer technique that uses different quantization granularities (global vs vector-wise) in the forward pass and switches back to standard precision for weight gradient computation.

Description

The SwitchBack approach performs quantized INT8 matrix multiplication in the forward pass for the linear transformation Y = X @ W^T, but "switches back" to standard-precision computation for the weight gradient dW = G^T @ X in the backward pass. This hybrid strategy is motivated by the observation that weight gradients are more sensitive to quantization noise than activations. Two quantization strategies are supported for the forward pass: global (single scaling factor per tensor for weights) and vector-wise (per-row scaling factors for both activations and weights). A memory-efficient variant saves quantized activations instead of full-precision during forward, trading backward compute for memory savings.

Usage

Apply this principle when training models where memory reduction from INT8 forward passes is desired but weight gradient quality must be preserved. It is a middle ground between full-precision training and fully-quantized training.

Theoretical Basis

Forward pass (quantized):

X_int8, scale_X = quantize_rowwise(X)
W_int8, scale_W = quantize_global(W)  # or quantize_rowwise
Y = int8_matmul_dequantize(X_int8, W_int8.T, scale_X, scale_W)

Backward pass (mixed):

# Gradient w.r.t. input: quantized
dX = int8_matmul_dequantize(dY_int8, W_int8, ...)

# Gradient w.r.t. weight: STANDARD precision ("switch back")
dW = matmul(dY.T, X)  # full-precision matmul

Related Pages

Implementation:Bitsandbytes_foundation_Bitsandbytes_SwitchBackLinear

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment