Workflow:Hiyouga LLaMA Factory LoRA SFT Finetuning

Knowledge Sources	LLaMA-Factory LLaMA-Factory Docs Examples README
Domains	LLMs, Fine_Tuning, LoRA, SFT
Last Updated	2026-02-06 19:00 GMT

Overview

End-to-end process for parameter-efficient supervised fine-tuning of large language models using Low-Rank Adaptation (LoRA) via the LLaMA-Factory CLI.

Description

This workflow covers the most common LLaMA-Factory use case: adapting a pre-trained LLM to follow instructions or perform domain-specific tasks using LoRA adapters. LoRA freezes the original model weights and injects small trainable low-rank matrices into attention and feedforward layers, reducing memory requirements by orders of magnitude compared to full fine-tuning. The workflow spans from YAML configuration through data preparation, model loading with adapter injection, supervised training, and saving the resulting adapter weights. It also covers the QLoRA variant, which combines LoRA with 4-bit quantization for even greater memory efficiency.

Usage

Execute this workflow when you have an instruction-tuning or chat dataset and need to adapt a base LLM for a specific task or domain, particularly when GPU memory is limited. This is the recommended starting point for most fine-tuning tasks in LLaMA-Factory, as it supports 100+ model architectures and requires only a YAML configuration file to run.

Execution Steps

Step 1: Configuration

Define the training job by writing a YAML configuration file specifying the model, dataset, training hyperparameters, and LoRA settings. The configuration maps to four argument groups: model arguments (model name, quantization), data arguments (dataset name, template, cutoff length), training arguments (epochs, batch size, learning rate, output directory), and finetuning arguments (LoRA rank, target modules, dropout).

Key considerations:

Set the correct chat template for your model family (auto-detected from model name in most cases)
Choose LoRA rank (typically 8-64) balancing capacity vs. efficiency
Set lora_target: all to apply LoRA to all linear layers, or specify specific module names
For QLoRA, add quantization_bit: 4 and quantization_method: bitsandbytes

Step 2: Argument Parsing and Validation

The CLI entry point parses the YAML configuration into structured dataclass objects, validates parameter compatibility, and resolves model-specific defaults. This includes checking that the training stage matches the dataset format, resolving the model's chat template from the constants registry, and configuring distributed training settings if applicable.

What happens:

YAML is parsed by the HfArgumentParser into ModelArguments, DataArguments, Seq2SeqTrainingArguments, and FinetuningArguments
Post-processing validates compatibility (e.g., LoRA + quantization combinations)
The model's template name is resolved from the constants registry in extras/constants.py

Step 3: Data Loading and Preprocessing

Load the training dataset(s) from local files, HuggingFace Hub, or ModelScope, then convert them to the unified internal format and tokenize using the model's chat template. The data pipeline supports Alpaca, ShareGPT, and OpenAI formats, and applies the appropriate processor for the SFT stage which creates input_ids with proper label masking (only assistant tokens are trained on).

Key considerations:

Multiple datasets can be mixed with configurable ratios via dataset_dir and dataset parameters
The SFT processor masks non-assistant tokens with IGNORE_INDEX (-100) so loss is only computed on model outputs
Sequence packing can be enabled to improve GPU utilization by combining short sequences
Multimodal data (images, video, audio) is handled automatically through the mm_plugin system

Step 4: Model Loading

Load the pre-trained model with the appropriate configuration, applying quantization if specified. The loader resolves the model architecture, configures attention implementation (flash attention, SDPA), sets up RoPE scaling for long contexts, and applies any model-specific patches.

What happens:

Model configuration is loaded and patched (attention type, RoPE, vocab size)
For QLoRA: BitsAndBytesConfig is created with 4-bit NF4 quantization and double quantization
The model is loaded via AutoModelForCausalLM with the configured dtype and device map
Model-specific patches are applied (Unsloth optimization, Liger kernels, LongLoRA)

Step 5: LoRA Adapter Injection

Initialize and inject LoRA adapter matrices into the model's target layers. This creates two small trainable matrices (A and B) for each target module such that the effective weight becomes W' = W + BA, where W is frozen and only A, B are updated during training. The adapter typically adds less than 1% additional parameters.

Key considerations:

Target modules can be specified explicitly or set to "all" for all linear layers
Advanced variants are supported: LoRA+ (different learning rates for A/B), DoRA (weight-decomposed), PiSSA (SVD-initialized), rsLoRA (rank-stabilized)
For QLoRA, adapters are attached to the quantized model layers
Gradient checkpointing is configured to trade compute for memory

Step 6: Training

Execute the supervised fine-tuning loop using the CustomSeq2SeqTrainer, which extends HuggingFace's Seq2SeqTrainer with custom loss functions, optimizers, and callbacks. Training proceeds for the configured number of epochs with gradient accumulation, learning rate scheduling, and periodic evaluation.

What happens:

The trainer initializes with the model, tokenizer, data collator, and datasets
Custom optimizers can be used (GaLore, BAdam, APOLLO, Adam-mini, Muon)
Training runs with bf16/fp16 mixed precision by default
Callbacks handle logging to TensorBoard/W&B/SwanLab, progress reporting, and loss curve plotting
Evaluation can compute accuracy, ROUGE, and BLEU metrics if a validation set is provided

Step 7: Save and Checkpoint

Save the trained LoRA adapter weights (not the full model) to the output directory. For LoRA training, only the small adapter files are saved, making the output very compact. The tokenizer and training state are also saved for resumption.

Key considerations:

Only adapter weights are saved (typically a few hundred MB vs. multi-GB for the full model)
Training can be resumed from checkpoints
The saved adapter can be loaded for inference or merged with the base model in a separate export step

Execution Diagram

GitHub URL

Workflow Repository