Principle:Axolotl ai cloud Axolotl FSDP Weight Consolidation

Knowledge Sources	PyTorch FSDP PyTorch Distributed Checkpoint Axolotl
Domains	Distributed_Training, Model_Export
Last Updated	2026-02-06 23:00 GMT

Overview

A post-training operation that consolidates sharded FSDP checkpoint files into a single unified model file suitable for inference and deployment.

Description

When training with FSDP (Fully Sharded Data Parallel), model parameters are distributed across multiple GPUs. Checkpoints saved during training contain sharded state dicts where each file holds only a fraction of the model weights. Before the model can be used for inference or shared, these shards must be consolidated into a single complete model file.

FSDP Weight Consolidation reads the distributed checkpoint directory, reconstructs the full state dict by gathering all shards, and saves the result as a standard HuggingFace model. This enables deployment on single-GPU machines or conversion to optimized inference formats.

Usage

Use FSDP weight consolidation when:

Training completed with FSDP and checkpoints are in sharded format
Need to deploy the model on a single GPU or CPU
Converting to inference formats (GGUF, ONNX, TensorRT)
Sharing the model on HuggingFace Hub

Theoretical Basis

FSDP Sharding:

# During FSDP training, each rank saves its shard:
# rank_0/: params[0:N/K], rank_1/: params[N/K:2N/K], ...
# where N = total params, K = num GPUs

# Consolidation reconstructs the full model:
full_state_dict = {}
for shard in all_shards:
    full_state_dict.update(load(shard))
model.load_state_dict(full_state_dict)
model.save_pretrained(output_path)

Related Pages

Implemented By

Implementation:Axolotl_ai_cloud_Axolotl_Merge_Fsdp_Weights

Uses Heuristic

Heuristic:Axolotl_ai_cloud_Axolotl_FSDP_Configuration_Guide

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment