Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Axolotl ai cloud Axolotl Full Finetuning Distributed

From Leeroopedia



Knowledge Sources
Domains LLMs, Fine_Tuning, Distributed_Training, FSDP, DeepSpeed
Last Updated 2026-02-06 22:00 GMT

Overview

End-to-end process for full-parameter fine-tuning of large language models across multiple GPUs using FSDP (Fully Sharded Data Parallel) or DeepSpeed, orchestrated through Axolotl's YAML configuration and CLI.

Description

This workflow covers full fine-tuning (FFT) where all model parameters are updated, as opposed to adapter-based methods. Because full fine-tuning requires significantly more memory, distributed training strategies are essential for models larger than what fits on a single GPU. Axolotl supports FSDP1, FSDP2, DeepSpeed ZeRO (stages 1-3), Tensor Parallelism, Context Parallelism, and hybrid configurations. The workflow covers distributed environment configuration, data loading with distributed samplers, sharded model training, and proper model weight consolidation after training.

Usage

Execute this workflow when you need to update all parameters of a base model (not just adapters) to achieve maximum fine-tuning quality, and have access to multiple GPUs or multi-node infrastructure. Typical scenarios include training a foundation model on large domain-specific corpora, adapting a model where LoRA quality is insufficient, or when the final deployment target does not support adapter serving.

Execution Steps

Step 1: Configuration with Distributed Settings

Create a YAML configuration file that specifies the base model, dataset, and critically, the distributed training strategy. Axolotl supports multiple parallelism backends configured via YAML: FSDP (fsdp/fsdp_config sections), DeepSpeed (deepspeed config path), or ND parallelism combining FSDP+TP+CP.

Key considerations:

  • Do not set adapter for full fine-tuning
  • For FSDP: configure fsdp and fsdp_config sections
  • For DeepSpeed: provide path to a DeepSpeed JSON config (zero1/zero2/zero3)
  • For ND parallelism: set tensor_parallel_degree, context_parallel_size
  • Choose appropriate fsdp_config.state_dict_type (FULL or SHARDED)

Step 2: Distributed Environment Setup

Axolotl automatically configures the distributed training environment based on the YAML config. This includes setting FSDP environment variables, DeepSpeed configuration injection, device mesh construction for ND parallelism, and process group initialization via Accelerate or Torchrun launchers.

Key considerations:

  • Use accelerate launcher (default) or torchrun for multi-GPU
  • NCCL settings are configured automatically for GPU communication
  • For multi-node: configure via SLURM or Ray
  • CPU offloading can be enabled for ZeRO-3 to handle very large models

Step 3: Dataset Loading with Distributed Sampling

Load and preprocess the training dataset with distributed-aware data loading. When running across multiple GPUs, the dataset is sharded across processes so each GPU trains on a different subset. Sample packing and multipack batch sampling are distributed-aware, ensuring consistent batch sizes across all workers.

Key considerations:

  • Dataset loading happens on all processes in parallel
  • Multipack batch sampler accounts for distributed rank
  • Sequence length and batch size affect total memory per GPU
  • Use gradient_accumulation_steps to simulate larger effective batch sizes

Step 4: Full Model Loading

Load the complete model without quantization or adapter injection. For FSDP, the model is loaded on meta device first, then sharded across GPUs during wrapping. For DeepSpeed ZeRO-3, model parameters are partitioned across all devices. The model loader applies attention mechanism patches and any configured optimizations (Liger kernels, torch compile).

Key considerations:

  • Full precision (bf16/fp16) model loading requires more memory per GPU
  • FSDP wraps and shards the model after loading
  • DeepSpeed ZeRO-3 partitions parameters during initialization
  • Gradient checkpointing is strongly recommended to reduce activation memory

Step 5: Distributed Training Execution

Execute the training loop with distributed gradient synchronization. The Accelerate-wrapped trainer handles gradient all-reduce (DDP), parameter sharding (FSDP), or ZeRO optimizer state partitioning (DeepSpeed) transparently. Training proceeds with mixed-precision compute, gradient accumulation across micro-batches, and periodic checkpoint saving.

Key considerations:

  • FSDP auto-wraps transformer layers for optimal sharding
  • DeepSpeed ZeRO stages trade memory for communication overhead
  • Tensor parallelism splits individual layers across GPUs
  • Monitor GPU memory utilization and communication overhead

Step 6: Model Weight Consolidation and Saving

After training, consolidate the distributed model weights into a single checkpoint. For FSDP with sharded state dict, the weights must be merged before deployment. For DeepSpeed ZeRO-3, the trainer handles weight gathering. The consolidated model is saved in HuggingFace format for direct inference or hub upload.

Key considerations:

  • FSDP FULL_STATE_DICT gathers all weights to rank 0 for saving
  • FSDP SHARDED_STATE_DICT saves distributed checkpoints (requires post-merge)
  • Use axolotl merge-sharded-fsdp-weights to consolidate sharded checkpoints
  • DeepSpeed handles weight consolidation automatically in most cases
  • Verify the final config.json has correct architecture names (FSDP prefix removal)

Execution Diagram

GitHub URL

Workflow Repository