Principle:Axolotl ai cloud Axolotl FSDP Weight Consolidation
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Training, Model_Export |
| Last Updated | 2026-02-06 23:00 GMT |
Overview
A post-training operation that consolidates sharded FSDP checkpoint files into a single unified model file suitable for inference and deployment.
Description
When training with FSDP (Fully Sharded Data Parallel), model parameters are distributed across multiple GPUs. Checkpoints saved during training contain sharded state dicts where each file holds only a fraction of the model weights. Before the model can be used for inference or shared, these shards must be consolidated into a single complete model file.
FSDP Weight Consolidation reads the distributed checkpoint directory, reconstructs the full state dict by gathering all shards, and saves the result as a standard HuggingFace model. This enables deployment on single-GPU machines or conversion to optimized inference formats.
Usage
Use FSDP weight consolidation when:
- Training completed with FSDP and checkpoints are in sharded format
- Need to deploy the model on a single GPU or CPU
- Converting to inference formats (GGUF, ONNX, TensorRT)
- Sharing the model on HuggingFace Hub
Theoretical Basis
FSDP Sharding:
# During FSDP training, each rank saves its shard:
# rank_0/: params[0:N/K], rank_1/: params[N/K:2N/K], ...
# where N = total params, K = num GPUs
# Consolidation reconstructs the full model:
full_state_dict = {}
for shard in all_shards:
full_state_dict.update(load(shard))
model.load_state_dict(full_state_dict)
model.save_pretrained(output_path)