Heuristic:Axolotl ai cloud Axolotl FSDP Configuration Guide
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Training, Optimization |
| Last Updated | 2026-02-06 22:33 GMT |
Overview
Configuration rules and compatibility constraints for FSDP (Fully Sharded Data Parallel) training in Axolotl, including FSDP1 deprecation and FSDP2 migration guidance.
Description
FSDP shards model parameters, gradients, and optimizer states across GPUs to enable training of models larger than single-GPU memory. Axolotl supports both FSDP1 (deprecated) and FSDP2, with extensive validation rules around optimizer compatibility, checkpoint formats, and parameter offloading. The configuration is complex: Axolotl sets over 15 environment variables based on the YAML config to orchestrate FSDP through Accelerate.
Usage
Apply these rules when configuring fsdp_config in the training YAML, particularly when choosing between FSDP1 and FSDP2, selecting optimizers, configuring CPU offloading, or troubleshooting FSDP-related errors.
The Insight (Rule of Thumb)
- Rule 1 - Use FSDP2: FSDP1 is deprecated in Axolotl. Always use `fsdp_version: 2` for better performance and compatibility.
- Rule 2 - Optimizer Compatibility: With FSDP2, do NOT use `adamw_8bit` or `adamw_bnb_8bit` optimizers (CUDA ops errors). Use `adamw_torch_8bit` instead.
- Rule 3 - QLoRA Checkpoint Type: For QLoRA with FSDP, use `SHARDED_STATE_DICT` for checkpointing to avoid memory spikes during saving.
- Rule 4 - CPU Offloading: Enable `fsdp_offload_params: true` for models that exceed aggregate GPU memory. Trades training speed for memory capacity.
- Rule 5 - Activation Checkpointing: Use `fsdp_activation_checkpointing: true` to reduce activation memory. Compatible with FSDP parameter sharding.
- Rule 6 - Weight Consolidation: Sharded FSDP checkpoints require post-training consolidation via `axolotl merge-sharded-fsdp-weights` before the model can be used for inference.
- Trade-off: FSDP2 provides better memory efficiency and performance than FSDP1, but requires more careful optimizer selection. CPU offloading significantly slows training but allows fitting much larger models.
Reasoning
FSDP2 uses PyTorch's native `torch.distributed.fsdp2` which has better integration with `torch.compile`, more efficient parameter gathering, and improved checkpoint handling compared to the legacy FSDP1 wrapper. The `adamw_bnb_8bit` optimizer incompatibility with FSDP2 stems from bitsandbytes CUDA operations not being compatible with the distributed tensor operations in FSDP2's parameter management, while `adamw_torch_8bit` uses PyTorch-native 8-bit operations that work correctly with distributed tensors.
Code Evidence
FSDP1 deprecation notice from `src/axolotl/utils/schemas/validation.py:859-868`:
@model_validator(mode="before")
@classmethod
def check_fsdp_version(cls, data):
fsdp_config = data.get("fsdp_config", {})
if fsdp_config and str(data.get("fsdp_version")) != "2":
LOG.info(
"FSDP1 will be deprecated in an upcoming release of Axolotl."
"We recommend that you use FSDP version 2 for better performance and compatibility."
)
FSDP2 optimizer incompatibility from `src/axolotl/utils/schemas/validation.py:964-968`:
if self.optimizer in ["adamw_8bit", "adamw_bnb_8bit"]:
# CUDA ops errors with bnb 8bit optimizer + FSDP2
raise ValueError(
f"FSDP2 not compatible with {self.optimizer.value}, use `adamw_torch_8bit` instead"
)
FSDP environment variable configuration from `src/axolotl/utils/trainer.py:601-630`:
if cfg.fsdp_config:
os.environ["ACCELERATE_USE_FSDP"] = "true"
if cfg.fsdp_config.fsdp_version == 2:
os.environ["FSDP_VERSION"] = "2"
if cfg.fsdp_config.fsdp_activation_checkpointing:
os.environ["FSDP_ACTIVATION_CHECKPOINTING"] = "true"
if cfg.fsdp_config.fsdp_offload_params:
os.environ["FSDP_OFFLOAD_PARAMS"] = "true"
os.environ["FSDP_SYNC_MODULE_STATES"] = "true"
os.environ["FSDP_CPU_RAM_EFFICIENT_LOADING"] = "true"
os.environ["FSDP_USE_ORIG_PARAMS"] = "true"