Workflow:Hiyouga LLaMA Factory LoRA SFT Finetuning
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, LoRA, SFT |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
End-to-end process for parameter-efficient supervised fine-tuning of large language models using Low-Rank Adaptation (LoRA) via the LLaMA-Factory CLI.
Description
This workflow covers the most common LLaMA-Factory use case: adapting a pre-trained LLM to follow instructions or perform domain-specific tasks using LoRA adapters. LoRA freezes the original model weights and injects small trainable low-rank matrices into attention and feedforward layers, reducing memory requirements by orders of magnitude compared to full fine-tuning. The workflow spans from YAML configuration through data preparation, model loading with adapter injection, supervised training, and saving the resulting adapter weights. It also covers the QLoRA variant, which combines LoRA with 4-bit quantization for even greater memory efficiency.
Usage
Execute this workflow when you have an instruction-tuning or chat dataset and need to adapt a base LLM for a specific task or domain, particularly when GPU memory is limited. This is the recommended starting point for most fine-tuning tasks in LLaMA-Factory, as it supports 100+ model architectures and requires only a YAML configuration file to run.
Execution Steps
Step 1: Configuration
Define the training job by writing a YAML configuration file specifying the model, dataset, training hyperparameters, and LoRA settings. The configuration maps to four argument groups: model arguments (model name, quantization), data arguments (dataset name, template, cutoff length), training arguments (epochs, batch size, learning rate, output directory), and finetuning arguments (LoRA rank, target modules, dropout).
Key considerations:
- Set the correct chat template for your model family (auto-detected from model name in most cases)
- Choose LoRA rank (typically 8-64) balancing capacity vs. efficiency
- Set
lora_target: allto apply LoRA to all linear layers, or specify specific module names - For QLoRA, add
quantization_bit: 4andquantization_method: bitsandbytes
Step 2: Argument Parsing and Validation
The CLI entry point parses the YAML configuration into structured dataclass objects, validates parameter compatibility, and resolves model-specific defaults. This includes checking that the training stage matches the dataset format, resolving the model's chat template from the constants registry, and configuring distributed training settings if applicable.
What happens:
- YAML is parsed by the HfArgumentParser into ModelArguments, DataArguments, Seq2SeqTrainingArguments, and FinetuningArguments
- Post-processing validates compatibility (e.g., LoRA + quantization combinations)
- The model's template name is resolved from the constants registry in extras/constants.py
Step 3: Data Loading and Preprocessing
Load the training dataset(s) from local files, HuggingFace Hub, or ModelScope, then convert them to the unified internal format and tokenize using the model's chat template. The data pipeline supports Alpaca, ShareGPT, and OpenAI formats, and applies the appropriate processor for the SFT stage which creates input_ids with proper label masking (only assistant tokens are trained on).
Key considerations:
- Multiple datasets can be mixed with configurable ratios via
dataset_diranddatasetparameters - The SFT processor masks non-assistant tokens with IGNORE_INDEX (-100) so loss is only computed on model outputs
- Sequence packing can be enabled to improve GPU utilization by combining short sequences
- Multimodal data (images, video, audio) is handled automatically through the mm_plugin system
Step 4: Model Loading
Load the pre-trained model with the appropriate configuration, applying quantization if specified. The loader resolves the model architecture, configures attention implementation (flash attention, SDPA), sets up RoPE scaling for long contexts, and applies any model-specific patches.
What happens:
- Model configuration is loaded and patched (attention type, RoPE, vocab size)
- For QLoRA: BitsAndBytesConfig is created with 4-bit NF4 quantization and double quantization
- The model is loaded via AutoModelForCausalLM with the configured dtype and device map
- Model-specific patches are applied (Unsloth optimization, Liger kernels, LongLoRA)
Step 5: LoRA Adapter Injection
Initialize and inject LoRA adapter matrices into the model's target layers. This creates two small trainable matrices (A and B) for each target module such that the effective weight becomes W' = W + BA, where W is frozen and only A, B are updated during training. The adapter typically adds less than 1% additional parameters.
Key considerations:
- Target modules can be specified explicitly or set to "all" for all linear layers
- Advanced variants are supported: LoRA+ (different learning rates for A/B), DoRA (weight-decomposed), PiSSA (SVD-initialized), rsLoRA (rank-stabilized)
- For QLoRA, adapters are attached to the quantized model layers
- Gradient checkpointing is configured to trade compute for memory
Step 6: Training
Execute the supervised fine-tuning loop using the CustomSeq2SeqTrainer, which extends HuggingFace's Seq2SeqTrainer with custom loss functions, optimizers, and callbacks. Training proceeds for the configured number of epochs with gradient accumulation, learning rate scheduling, and periodic evaluation.
What happens:
- The trainer initializes with the model, tokenizer, data collator, and datasets
- Custom optimizers can be used (GaLore, BAdam, APOLLO, Adam-mini, Muon)
- Training runs with bf16/fp16 mixed precision by default
- Callbacks handle logging to TensorBoard/W&B/SwanLab, progress reporting, and loss curve plotting
- Evaluation can compute accuracy, ROUGE, and BLEU metrics if a validation set is provided
Step 7: Save and Checkpoint
Save the trained LoRA adapter weights (not the full model) to the output directory. For LoRA training, only the small adapter files are saved, making the output very compact. The tokenizer and training state are also saved for resumption.
Key considerations:
- Only adapter weights are saved (typically a few hundred MB vs. multi-GB for the full model)
- Training can be resumed from checkpoints
- The saved adapter can be loaded for inference or merged with the base model in a separate export step