Workflow:Bitsandbytes foundation Bitsandbytes FSDP QLoRA Distributed Training

Knowledge Sources	bitsandbytes QLoRA FSDP QLoRA Guide FSDP QLoRA by Answer.AI PEFT FSDP Integration
Domains	LLMs, Fine_Tuning, Distributed_Training, Quantization
Last Updated	2026-02-07 14:00 GMT

Overview

End-to-end process for distributed fine-tuning of large language models (up to 70B parameters) using FSDP (Fully Sharded Data Parallel) combined with 4-bit quantization and LoRA adapters on multi-GPU setups.

Description

This workflow combines three techniques for memory-efficient distributed training: FSDP for sharding model parameters, gradients, and optimizer states across GPUs; 4-bit NF4 quantization from bitsandbytes to compress the base model weights; and LoRA (Low-Rank Adaptation) to add small trainable adapter matrices while keeping the quantized base frozen. The critical bitsandbytes innovation enabling this combination is the quant_storage parameter, which allows quantized weights to be stored in FSDP-compatible float dtypes (bfloat16/float16/float32) rather than the default uint8, enabling FSDP to shard quantized layers correctly. This workflow integrates with the Hugging Face ecosystem (Transformers, PEFT, TRL, Accelerate).

Usage

Execute this workflow when you need to fine-tune a very large language model (30B-70B+ parameters) on custom data and have multiple GPUs available, but each GPU has limited memory (e.g., 2x 24GB consumer GPUs). This is the standard approach for training models that would not fit on a single GPU even with 4-bit quantization alone.

Execution Steps

Step 1: Configure 4-bit Quantization with FSDP-Compatible Storage

Create a BitsAndBytesConfig with 4-bit quantization enabled, NF4 quantization type, bfloat16 compute dtype, and critically, set bnb_4bit_quant_storage to a float dtype (e.g., torch.bfloat16). This quant_storage parameter is what makes FSDP sharding possible: FSDP can only wrap layers with matching float dtypes, so storing quantized weights as bfloat16 (instead of the default uint8) allows Linear4bit layers to be wrapped and sharded identically to standard Linear layers.

Key considerations:

The quant_storage dtype MUST match the torch_dtype used for model loading
If storage types do not match, each Linear4bit layer is wrapped individually (less efficient)
Double quantization (bnb_4bit_use_double_quant=True) is recommended for additional memory savings
bitsandbytes uses StoreChar internally to handle read/write regardless of the underlying storage type

Step 2: Load Base Model with Quantization Config

Load the pretrained model from the Hugging Face Hub, passing the quantization configuration and matching torch_dtype. The model loader replaces Linear layers with Linear4bit layers. The Params4bit objects are initialized with the specified quant_storage dtype, enabling FSDP compatibility.

Key considerations:

Set torch_dtype to match bnb_4bit_quant_storage for correct FSDP wrapping
Quantization occurs lazily when layers are moved to GPU (same as the inference workflow)
The model can be loaded on CPU first and then distributed via FSDP

Step 3: Configure LoRA Adapters

Set up the LoRA configuration specifying the rank (r), alpha scaling factor, dropout, and target modules. For QLoRA fine-tuning, target_modules="all-linear" is recommended to inject adapters into all linear layers. Only the small LoRA adapter weights (typically less than 1% of total parameters) will be trained; the quantized base model weights remain frozen.

Key considerations:

LoRA rank (r) controls the adapter capacity; common values are 16-64
lora_alpha controls the scaling of the adapter contribution
target_modules="all-linear" applies adapters to every Linear4bit layer in the model
Only adapter parameters have requires_grad=True; quantized base weights are frozen

Step 4: Configure FSDP and Launch Distributed Training

Set up the FSDP configuration (via Accelerate config or torchrun) specifying sharding strategy, auto-wrapping policy, and mixed-precision settings. Launch the distributed training job using Accelerate or torchrun. FSDP shards the model parameters (including quantized weights and LoRA adapters), gradients, and optimizer states across the available GPUs.

Key considerations:

Use the PEFT-provided fsdp_config_qlora.yaml as a starting configuration
FSDP auto-wrapping groups layers with matching dtypes, which is why quant_storage alignment is critical
The Accelerate library handles the FSDP setup and distributed launch
Ensure all GPUs have sufficient memory for their shard plus activation memory

Step 5: Execute Training Loop

Run the training loop using SFTTrainer (from TRL) or a custom training loop. The trainer handles forward passes through the quantized model with LoRA adapters, loss computation, backward passes (gradients flow only through LoRA parameters), and optimizer steps. FSDP manages the all-gather and reduce-scatter communication for parameter sharding.

Key considerations:

Forward pass: quantized weights are dequantized, LoRA adapters are applied, matmul is performed
Backward pass: gradients are computed only for LoRA adapter parameters
FSDP all-gathers parameter shards before forward/backward and re-shards after
Use a paged optimizer (e.g., PagedAdamW8bit) for additional memory savings on optimizer states

Step 6: Save and Merge Adapters

After training, save the LoRA adapter weights. These can be saved separately from the base model and loaded later for inference, or merged into the base model for deployment. The adapter weights are small (typically tens of megabytes) compared to the full model.

Key considerations:

Save only the LoRA adapter weights, not the full quantized model
Adapters can be merged into a dequantized model for deployment without bitsandbytes dependency
PEFT provides save_pretrained() and load_pretrained() for adapter management

Execution Diagram

GitHub URL

Workflow Repository