Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Axolotl ai cloud Axolotl FSDP Weight Consolidation

From Leeroopedia


Knowledge Sources
Domains Distributed_Training, Model_Export
Last Updated 2026-02-06 23:00 GMT

Overview

A post-training operation that consolidates sharded FSDP checkpoint files into a single unified model file suitable for inference and deployment.

Description

When training with FSDP (Fully Sharded Data Parallel), model parameters are distributed across multiple GPUs. Checkpoints saved during training contain sharded state dicts where each file holds only a fraction of the model weights. Before the model can be used for inference or shared, these shards must be consolidated into a single complete model file.

FSDP Weight Consolidation reads the distributed checkpoint directory, reconstructs the full state dict by gathering all shards, and saves the result as a standard HuggingFace model. This enables deployment on single-GPU machines or conversion to optimized inference formats.

Usage

Use FSDP weight consolidation when:

  • Training completed with FSDP and checkpoints are in sharded format
  • Need to deploy the model on a single GPU or CPU
  • Converting to inference formats (GGUF, ONNX, TensorRT)
  • Sharing the model on HuggingFace Hub

Theoretical Basis

FSDP Sharding:

# During FSDP training, each rank saves its shard:
# rank_0/: params[0:N/K], rank_1/: params[N/K:2N/K], ...
# where N = total params, K = num GPUs

# Consolidation reconstructs the full model:
full_state_dict = {}
for shard in all_shards:
    full_state_dict.update(load(shard))
model.load_state_dict(full_state_dict)
model.save_pretrained(output_path)

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment