Principle:Bitsandbytes foundation Bitsandbytes SwitchBack Quantized Linear
| Knowledge Sources | |
|---|---|
| Domains | Quantization, INT8, Training |
| Last Updated | 2026-02-07 13:31 GMT |
Overview
An INT8 quantized linear layer technique that uses different quantization granularities (global vs vector-wise) in the forward pass and switches back to standard precision for weight gradient computation.
Description
The SwitchBack approach performs quantized INT8 matrix multiplication in the forward pass for the linear transformation Y = X @ W^T, but "switches back" to standard-precision computation for the weight gradient dW = G^T @ X in the backward pass. This hybrid strategy is motivated by the observation that weight gradients are more sensitive to quantization noise than activations. Two quantization strategies are supported for the forward pass: global (single scaling factor per tensor for weights) and vector-wise (per-row scaling factors for both activations and weights). A memory-efficient variant saves quantized activations instead of full-precision during forward, trading backward compute for memory savings.
Usage
Apply this principle when training models where memory reduction from INT8 forward passes is desired but weight gradient quality must be preserved. It is a middle ground between full-precision training and fully-quantized training.
Theoretical Basis
Forward pass (quantized):
X_int8, scale_X = quantize_rowwise(X)
W_int8, scale_W = quantize_global(W) # or quantize_rowwise
Y = int8_matmul_dequantize(X_int8, W_int8.T, scale_X, scale_W)
Backward pass (mixed):
# Gradient w.r.t. input: quantized
dX = int8_matmul_dequantize(dY_int8, W_int8, ...)
# Gradient w.r.t. weight: STANDARD precision ("switch back")
dW = matmul(dY.T, X) # full-precision matmul