Workflow:Axolotl ai cloud Axolotl Full Finetuning Distributed
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, Distributed_Training, FSDP, DeepSpeed |
| Last Updated | 2026-02-06 22:00 GMT |
Overview
End-to-end process for full-parameter fine-tuning of large language models across multiple GPUs using FSDP (Fully Sharded Data Parallel) or DeepSpeed, orchestrated through Axolotl's YAML configuration and CLI.
Description
This workflow covers full fine-tuning (FFT) where all model parameters are updated, as opposed to adapter-based methods. Because full fine-tuning requires significantly more memory, distributed training strategies are essential for models larger than what fits on a single GPU. Axolotl supports FSDP1, FSDP2, DeepSpeed ZeRO (stages 1-3), Tensor Parallelism, Context Parallelism, and hybrid configurations. The workflow covers distributed environment configuration, data loading with distributed samplers, sharded model training, and proper model weight consolidation after training.
Usage
Execute this workflow when you need to update all parameters of a base model (not just adapters) to achieve maximum fine-tuning quality, and have access to multiple GPUs or multi-node infrastructure. Typical scenarios include training a foundation model on large domain-specific corpora, adapting a model where LoRA quality is insufficient, or when the final deployment target does not support adapter serving.
Execution Steps
Step 1: Configuration with Distributed Settings
Create a YAML configuration file that specifies the base model, dataset, and critically, the distributed training strategy. Axolotl supports multiple parallelism backends configured via YAML: FSDP (fsdp/fsdp_config sections), DeepSpeed (deepspeed config path), or ND parallelism combining FSDP+TP+CP.
Key considerations:
- Do not set
adapterfor full fine-tuning - For FSDP: configure
fsdpandfsdp_configsections - For DeepSpeed: provide path to a DeepSpeed JSON config (zero1/zero2/zero3)
- For ND parallelism: set
tensor_parallel_degree,context_parallel_size - Choose appropriate
fsdp_config.state_dict_type(FULL or SHARDED)
Step 2: Distributed Environment Setup
Axolotl automatically configures the distributed training environment based on the YAML config. This includes setting FSDP environment variables, DeepSpeed configuration injection, device mesh construction for ND parallelism, and process group initialization via Accelerate or Torchrun launchers.
Key considerations:
- Use
acceleratelauncher (default) ortorchrunfor multi-GPU - NCCL settings are configured automatically for GPU communication
- For multi-node: configure via SLURM or Ray
- CPU offloading can be enabled for ZeRO-3 to handle very large models
Step 3: Dataset Loading with Distributed Sampling
Load and preprocess the training dataset with distributed-aware data loading. When running across multiple GPUs, the dataset is sharded across processes so each GPU trains on a different subset. Sample packing and multipack batch sampling are distributed-aware, ensuring consistent batch sizes across all workers.
Key considerations:
- Dataset loading happens on all processes in parallel
- Multipack batch sampler accounts for distributed rank
- Sequence length and batch size affect total memory per GPU
- Use
gradient_accumulation_stepsto simulate larger effective batch sizes
Step 4: Full Model Loading
Load the complete model without quantization or adapter injection. For FSDP, the model is loaded on meta device first, then sharded across GPUs during wrapping. For DeepSpeed ZeRO-3, model parameters are partitioned across all devices. The model loader applies attention mechanism patches and any configured optimizations (Liger kernels, torch compile).
Key considerations:
- Full precision (bf16/fp16) model loading requires more memory per GPU
- FSDP wraps and shards the model after loading
- DeepSpeed ZeRO-3 partitions parameters during initialization
- Gradient checkpointing is strongly recommended to reduce activation memory
Step 5: Distributed Training Execution
Execute the training loop with distributed gradient synchronization. The Accelerate-wrapped trainer handles gradient all-reduce (DDP), parameter sharding (FSDP), or ZeRO optimizer state partitioning (DeepSpeed) transparently. Training proceeds with mixed-precision compute, gradient accumulation across micro-batches, and periodic checkpoint saving.
Key considerations:
- FSDP auto-wraps transformer layers for optimal sharding
- DeepSpeed ZeRO stages trade memory for communication overhead
- Tensor parallelism splits individual layers across GPUs
- Monitor GPU memory utilization and communication overhead
Step 6: Model Weight Consolidation and Saving
After training, consolidate the distributed model weights into a single checkpoint. For FSDP with sharded state dict, the weights must be merged before deployment. For DeepSpeed ZeRO-3, the trainer handles weight gathering. The consolidated model is saved in HuggingFace format for direct inference or hub upload.
Key considerations:
- FSDP FULL_STATE_DICT gathers all weights to rank 0 for saving
- FSDP SHARDED_STATE_DICT saves distributed checkpoints (requires post-merge)
- Use
axolotl merge-sharded-fsdp-weightsto consolidate sharded checkpoints - DeepSpeed handles weight consolidation automatically in most cases
- Verify the final config.json has correct architecture names (FSDP prefix removal)