Workflow:Hiyouga LLaMA Factory Full Parameter SFT

Knowledge Sources	LLaMA-Factory LLaMA-Factory Docs DeepSpeed Documentation
Domains	LLMs, Fine_Tuning, SFT, Distributed_Training
Last Updated	2026-02-06 19:00 GMT

Overview

End-to-end process for full-parameter supervised fine-tuning of large language models using distributed training with DeepSpeed ZeRO or FSDP.

Description

This workflow covers full-parameter fine-tuning where all model weights are updated during training. Unlike LoRA which adds small adapter matrices, full fine-tuning modifies every parameter in the model, potentially achieving higher quality but requiring significantly more GPU memory and compute. To make this feasible for large models, the workflow leverages distributed training strategies: DeepSpeed ZeRO (stages 0, 2, 3) partitions optimizer states, gradients, and parameters across GPUs, while FSDP provides PyTorch-native model sharding. The workflow covers multi-GPU and multi-node configurations.

Usage

Execute this workflow when maximum model quality is required and sufficient GPU resources are available (typically multiple GPUs with aggregate memory exceeding the model size). Full fine-tuning is preferred when the task domain differs significantly from the pre-training data, when the full model will be deployed (no adapter overhead), or when LoRA's capacity is insufficient for the task complexity.

Execution Steps

Step 1: Configuration

Define the full fine-tuning job with a YAML configuration specifying finetuning_type: full, the DeepSpeed or FSDP configuration, multi-GPU settings, and standard training hyperparameters. Full fine-tuning typically requires a DeepSpeed ZeRO-3 configuration for models larger than a single GPU's memory.

Key considerations:

Set finetuning_type: full to enable full parameter training
Include a DeepSpeed config (e.g., deepspeed: examples/deepspeed/ds_z3_config.json)
Use FORCE_TORCHRUN=1 environment variable to enable distributed launching
Learning rate should be lower than LoRA (typically 1e-5 to 5e-5)
For FSDP, reference an accelerate config instead of DeepSpeed

Step 2: Distributed Environment Setup

The launcher detects distributed training requirements and configures the multi-process environment. For torchrun-based launching, it sets up process groups, assigns ranks, and initializes the communication backend. DeepSpeed or FSDP initialization is deferred to the trainer.

What happens:

The launcher detects FORCE_TORCHRUN or multi-GPU settings and launches via torchrun
Process group initialization establishes NCCL communication between GPUs
Each process receives its rank, world size, and local rank assignments
For multi-node training, the master address and port are configured

Step 3: Data Loading and Preprocessing

Load and preprocess the training dataset identically to the LoRA SFT workflow. The data pipeline produces tokenized sequences with label masking for the SFT stage. Dataset sharding across distributed workers is handled automatically by the trainer's distributed sampler.

Key considerations:

Data preprocessing is identical to LoRA SFT (same templates, processors, and collators)
The distributed sampler ensures each GPU processes a unique subset of the data
Preprocessing can be done in advance with the preprocessing_num_workers parameter
Dataset caching avoids redundant preprocessing across training runs

Step 4: Model Loading with Distributed Strategy

Load the full model and initialize the distributed training strategy. For DeepSpeed ZeRO-3, model parameters are partitioned across GPUs during loading. For FSDP, the model is sharded after loading. All parameters are set as trainable.

What happens:

The model is loaded via AutoModelForCausalLM with the configured precision (bf16/fp16)
For ZeRO-3: parameters are partitioned across GPUs, with each GPU holding only a shard
For FSDP: the model is wrapped with FullyShardedDataParallel after loading
All parameters are marked as trainable (no frozen layers)
Gradient checkpointing is configured to reduce memory usage

Step 5: Training

Execute the supervised fine-tuning loop with distributed training. Each GPU processes its data shard, computes local gradients, and synchronizes through all-reduce operations. DeepSpeed handles optimizer state partitioning and gradient accumulation across the distributed setup.

What happens:

Training proceeds with synchronized gradient updates across all GPUs
DeepSpeed ZeRO manages optimizer state partitioning and gradient reduction
Mixed precision training (bf16) reduces memory and increases throughput
Gradient accumulation allows effective batch sizes larger than per-GPU memory permits
CPU offloading can be enabled for optimizer states or parameters when GPU memory is tight

Step 6: Save Full Model

Save the complete fine-tuned model weights. For distributed training, the model must be gathered from all shards before saving. The saved model is a complete standalone model that can be loaded directly without any adapter.

Key considerations:

DeepSpeed ZeRO-3 requires gathering all parameter shards to rank 0 for saving
The output is a complete model (same size as the original, typically multi-GB)
Checkpointing saves distributed state for resumable training
The saved model can be used directly for inference without any adapter merging step

Execution Diagram

GitHub URL

Workflow Repository