Workflow:Hiyouga LLaMA Factory Full Parameter SFT
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, SFT, Distributed_Training |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
End-to-end process for full-parameter supervised fine-tuning of large language models using distributed training with DeepSpeed ZeRO or FSDP.
Description
This workflow covers full-parameter fine-tuning where all model weights are updated during training. Unlike LoRA which adds small adapter matrices, full fine-tuning modifies every parameter in the model, potentially achieving higher quality but requiring significantly more GPU memory and compute. To make this feasible for large models, the workflow leverages distributed training strategies: DeepSpeed ZeRO (stages 0, 2, 3) partitions optimizer states, gradients, and parameters across GPUs, while FSDP provides PyTorch-native model sharding. The workflow covers multi-GPU and multi-node configurations.
Usage
Execute this workflow when maximum model quality is required and sufficient GPU resources are available (typically multiple GPUs with aggregate memory exceeding the model size). Full fine-tuning is preferred when the task domain differs significantly from the pre-training data, when the full model will be deployed (no adapter overhead), or when LoRA's capacity is insufficient for the task complexity.
Execution Steps
Step 1: Configuration
Define the full fine-tuning job with a YAML configuration specifying finetuning_type: full, the DeepSpeed or FSDP configuration, multi-GPU settings, and standard training hyperparameters. Full fine-tuning typically requires a DeepSpeed ZeRO-3 configuration for models larger than a single GPU's memory.
Key considerations:
- Set
finetuning_type: fullto enable full parameter training - Include a DeepSpeed config (e.g.,
deepspeed: examples/deepspeed/ds_z3_config.json) - Use
FORCE_TORCHRUN=1environment variable to enable distributed launching - Learning rate should be lower than LoRA (typically 1e-5 to 5e-5)
- For FSDP, reference an accelerate config instead of DeepSpeed
Step 2: Distributed Environment Setup
The launcher detects distributed training requirements and configures the multi-process environment. For torchrun-based launching, it sets up process groups, assigns ranks, and initializes the communication backend. DeepSpeed or FSDP initialization is deferred to the trainer.
What happens:
- The launcher detects FORCE_TORCHRUN or multi-GPU settings and launches via torchrun
- Process group initialization establishes NCCL communication between GPUs
- Each process receives its rank, world size, and local rank assignments
- For multi-node training, the master address and port are configured
Step 3: Data Loading and Preprocessing
Load and preprocess the training dataset identically to the LoRA SFT workflow. The data pipeline produces tokenized sequences with label masking for the SFT stage. Dataset sharding across distributed workers is handled automatically by the trainer's distributed sampler.
Key considerations:
- Data preprocessing is identical to LoRA SFT (same templates, processors, and collators)
- The distributed sampler ensures each GPU processes a unique subset of the data
- Preprocessing can be done in advance with the
preprocessing_num_workersparameter - Dataset caching avoids redundant preprocessing across training runs
Step 4: Model Loading with Distributed Strategy
Load the full model and initialize the distributed training strategy. For DeepSpeed ZeRO-3, model parameters are partitioned across GPUs during loading. For FSDP, the model is sharded after loading. All parameters are set as trainable.
What happens:
- The model is loaded via AutoModelForCausalLM with the configured precision (bf16/fp16)
- For ZeRO-3: parameters are partitioned across GPUs, with each GPU holding only a shard
- For FSDP: the model is wrapped with FullyShardedDataParallel after loading
- All parameters are marked as trainable (no frozen layers)
- Gradient checkpointing is configured to reduce memory usage
Step 5: Training
Execute the supervised fine-tuning loop with distributed training. Each GPU processes its data shard, computes local gradients, and synchronizes through all-reduce operations. DeepSpeed handles optimizer state partitioning and gradient accumulation across the distributed setup.
What happens:
- Training proceeds with synchronized gradient updates across all GPUs
- DeepSpeed ZeRO manages optimizer state partitioning and gradient reduction
- Mixed precision training (bf16) reduces memory and increases throughput
- Gradient accumulation allows effective batch sizes larger than per-GPU memory permits
- CPU offloading can be enabled for optimizer states or parameters when GPU memory is tight
Step 6: Save Full Model
Save the complete fine-tuned model weights. For distributed training, the model must be gathered from all shards before saving. The saved model is a complete standalone model that can be loaded directly without any adapter.
Key considerations:
- DeepSpeed ZeRO-3 requires gathering all parameter shards to rank 0 for saving
- The output is a complete model (same size as the original, typically multi-GB)
- Checkpointing saves distributed state for resumable training
- The saved model can be used directly for inference without any adapter merging step