Principle:Bitsandbytes foundation Bitsandbytes FSDP 4bit Quantization
Metadata
| Field | Value |
|---|---|
| Sources | Paper: QLoRA, Blog: FSDP QLoRA, Repo: bitsandbytes |
| Domains | Quantization, Distributed_Training |
| Last updated | 2026-02-07 14:00 GMT |
Overview
A specialized 4-bit quantization approach that stores quantized weights in a float dtype (e.g., bfloat16) to enable compatibility with FSDP parameter sharding.
Description
Standard 4-bit quantization stores packed weights in uint8, but FSDP requires all parameters to share a uniform dtype for sharding and all-gather operations. The solution is quant_storage=torch.bfloat16: pack 4-bit values into bfloat16 tensors instead of uint8. This allows FSDP to treat quantized weights as regular bfloat16 parameters for sharding, while the actual data remains 4-bit quantized.
The torch_dtype parameter must match quant_storage for proper FSDP operation.
A critical helper function fix_4bit_weight_quant_state_from_module recovers quantization state (QuantState) that may be lost during FSDP shard/unshard operations. This function is called at the start of every Linear4bit.forward() to ensure the weight tensor always has its quantization metadata available.
Usage
Required for distributed fine-tuning of large models (e.g., 70B parameters) across multiple GPUs using FSDP. Enables training models that would not fit on a single GPU even with quantization.
Typical configuration:
from transformers import BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_storage=torch.bfloat16, # Key for FSDP compatibility
)
Theoretical Basis
FSDP shards model parameters across data-parallel ranks. Each rank holds 1/N of each parameter. For this to work, all parameters must be in a uniform dtype that supports sharding (gather/scatter operations).
By storing 4-bit data in bfloat16 containers, we satisfy FSDP's dtype requirement while maintaining 4-bit compression. The quant_storage dtype is set on both the Linear4bit module (for state recovery) and the Params4bit weight (for actual storage).
The key insight is that the container dtype (bfloat16) is separate from the data representation (4-bit quantized values). FSDP operates on the container dtype for sharding and communication, while the quantized data inside those containers is preserved transparently.