Workflow:Bitsandbytes foundation Bitsandbytes FSDP QLoRA Distributed Training
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, Distributed_Training, Quantization |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
End-to-end process for distributed fine-tuning of large language models (up to 70B parameters) using FSDP (Fully Sharded Data Parallel) combined with 4-bit quantization and LoRA adapters on multi-GPU setups.
Description
This workflow combines three techniques for memory-efficient distributed training: FSDP for sharding model parameters, gradients, and optimizer states across GPUs; 4-bit NF4 quantization from bitsandbytes to compress the base model weights; and LoRA (Low-Rank Adaptation) to add small trainable adapter matrices while keeping the quantized base frozen. The critical bitsandbytes innovation enabling this combination is the quant_storage parameter, which allows quantized weights to be stored in FSDP-compatible float dtypes (bfloat16/float16/float32) rather than the default uint8, enabling FSDP to shard quantized layers correctly. This workflow integrates with the Hugging Face ecosystem (Transformers, PEFT, TRL, Accelerate).
Usage
Execute this workflow when you need to fine-tune a very large language model (30B-70B+ parameters) on custom data and have multiple GPUs available, but each GPU has limited memory (e.g., 2x 24GB consumer GPUs). This is the standard approach for training models that would not fit on a single GPU even with 4-bit quantization alone.
Execution Steps
Step 1: Configure 4-bit Quantization with FSDP-Compatible Storage
Create a BitsAndBytesConfig with 4-bit quantization enabled, NF4 quantization type, bfloat16 compute dtype, and critically, set bnb_4bit_quant_storage to a float dtype (e.g., torch.bfloat16). This quant_storage parameter is what makes FSDP sharding possible: FSDP can only wrap layers with matching float dtypes, so storing quantized weights as bfloat16 (instead of the default uint8) allows Linear4bit layers to be wrapped and sharded identically to standard Linear layers.
Key considerations:
- The quant_storage dtype MUST match the torch_dtype used for model loading
- If storage types do not match, each Linear4bit layer is wrapped individually (less efficient)
- Double quantization (bnb_4bit_use_double_quant=True) is recommended for additional memory savings
- bitsandbytes uses StoreChar internally to handle read/write regardless of the underlying storage type
Step 2: Load Base Model with Quantization Config
Load the pretrained model from the Hugging Face Hub, passing the quantization configuration and matching torch_dtype. The model loader replaces Linear layers with Linear4bit layers. The Params4bit objects are initialized with the specified quant_storage dtype, enabling FSDP compatibility.
Key considerations:
- Set torch_dtype to match bnb_4bit_quant_storage for correct FSDP wrapping
- Quantization occurs lazily when layers are moved to GPU (same as the inference workflow)
- The model can be loaded on CPU first and then distributed via FSDP
Step 3: Configure LoRA Adapters
Set up the LoRA configuration specifying the rank (r), alpha scaling factor, dropout, and target modules. For QLoRA fine-tuning, target_modules="all-linear" is recommended to inject adapters into all linear layers. Only the small LoRA adapter weights (typically less than 1% of total parameters) will be trained; the quantized base model weights remain frozen.
Key considerations:
- LoRA rank (r) controls the adapter capacity; common values are 16-64
- lora_alpha controls the scaling of the adapter contribution
- target_modules="all-linear" applies adapters to every Linear4bit layer in the model
- Only adapter parameters have requires_grad=True; quantized base weights are frozen
Step 4: Configure FSDP and Launch Distributed Training
Set up the FSDP configuration (via Accelerate config or torchrun) specifying sharding strategy, auto-wrapping policy, and mixed-precision settings. Launch the distributed training job using Accelerate or torchrun. FSDP shards the model parameters (including quantized weights and LoRA adapters), gradients, and optimizer states across the available GPUs.
Key considerations:
- Use the PEFT-provided fsdp_config_qlora.yaml as a starting configuration
- FSDP auto-wrapping groups layers with matching dtypes, which is why quant_storage alignment is critical
- The Accelerate library handles the FSDP setup and distributed launch
- Ensure all GPUs have sufficient memory for their shard plus activation memory
Step 5: Execute Training Loop
Run the training loop using SFTTrainer (from TRL) or a custom training loop. The trainer handles forward passes through the quantized model with LoRA adapters, loss computation, backward passes (gradients flow only through LoRA parameters), and optimizer steps. FSDP manages the all-gather and reduce-scatter communication for parameter sharding.
Key considerations:
- Forward pass: quantized weights are dequantized, LoRA adapters are applied, matmul is performed
- Backward pass: gradients are computed only for LoRA adapter parameters
- FSDP all-gathers parameter shards before forward/backward and re-shards after
- Use a paged optimizer (e.g., PagedAdamW8bit) for additional memory savings on optimizer states
Step 6: Save and Merge Adapters
After training, save the LoRA adapter weights. These can be saved separately from the base model and loaded later for inference, or merged into the base model for deployment. The adapter weights are small (typically tens of megabytes) compared to the full model.
Key considerations:
- Save only the LoRA adapter weights, not the full quantized model
- Adapters can be merged into a dequantized model for deployment without bitsandbytes dependency
- PEFT provides save_pretrained() and load_pretrained() for adapter management