Workflow:Hiyouga LLaMA Factory Model Export and Merging

Knowledge Sources	LLaMA-Factory LLaMA-Factory Docs PEFT Documentation
Domains	LLMs, Model_Export, Quantization, Deployment
Last Updated	2026-02-06 19:00 GMT

Overview

End-to-end process for merging LoRA adapter weights into a base model and exporting the result in various formats including HuggingFace, GGUF, and quantized formats.

Description

This workflow handles the post-training step of preparing a fine-tuned model for deployment. When using LoRA training, the adapter weights are stored separately from the base model. This workflow merges those adapter weights into the base model to produce a standalone model, then optionally applies post-training quantization (GPTQ, AWQ, AutoRound) for reduced model size and faster inference. The export function also supports generating Ollama modelfiles for local deployment and pushing models directly to HuggingFace Hub or ModelScope.

Usage

Execute this workflow after LoRA-based training is complete and you need a standalone deployment-ready model. Common scenarios include: merging LoRA adapters for frameworks that do not support adapter loading, applying GPTQ quantization for reduced memory deployment, exporting to GGUF format for llama.cpp or Ollama, or publishing the merged model to HuggingFace Hub.

Execution Steps

Step 1: Configuration

Define the export job with a YAML configuration specifying the base model, adapter checkpoint path, export directory, and optional quantization settings. The configuration uses the llamafactory-cli export command rather than the training command.

Key considerations:

Set model_name_or_path to the original base model
Set adapter_name_or_path to the trained LoRA adapter checkpoint
Set export_dir for the output merged model path
For GPTQ quantization, set export_quantization_bit and export_quantization_dataset
Set template to match the model's chat format
export_legacy_format: false uses safetensors format

Step 2: Argument Parsing

Parse the export configuration and validate the model-adapter compatibility. The parser verifies that the adapter was trained on the specified base model and resolves the export format settings.

What happens:

Arguments are parsed into ModelArguments and FinetuningArguments
The adapter metadata is checked for compatibility with the base model
Export format and quantization settings are validated
The device map is configured for the export process

Step 3: Model and Adapter Loading

Load the base model and the trained LoRA adapter weights. The model is loaded at full precision (or the training precision) to ensure accurate weight merging. Multiple adapters can be loaded if the training used adapter stacking.

What happens:

The base model is loaded at the configured precision
LoRA adapter weights are loaded from the checkpoint directory
For multiple adapters, they are loaded in sequence
The tokenizer and processor are loaded with any custom tokens added during training

Step 4: Weight Merging

Merge the LoRA adapter matrices into the base model weights. For each adapted layer, the effective weight W' = W + BA is computed and replaces the original weight W. After merging, the model no longer requires the adapter files and behaves as a standard pre-trained model.

What happens:

For each LoRA-adapted layer, the low-rank matrices A and B are multiplied and added to the original weight
The PEFT wrapper is removed, leaving a standard model
For PiSSA adapters, the residual model is used as the base for merging
The merged model's architecture is identical to the original base model

Step 5: Post-Training Quantization (Optional)

Apply post-training quantization to reduce the merged model's size and memory footprint. GPTQ quantization uses a calibration dataset to determine optimal quantization parameters for each layer.

Key considerations:

GPTQ: Requires a calibration dataset, produces models compatible with AutoGPTQ and Transformers
AWQ: Activation-aware quantization with similar calibration requirements
AutoRound: Automatic quantization with built-in calibration
Quantization bit width is typically 4 or 8 bits
The quantized model can be significantly smaller (e.g., 7B model from ~14GB to ~4GB at 4-bit)

Step 6: Export and Save

Save the final model in the requested format to the export directory. The export function also handles format conversion and optional model hub uploading.

What happens:

The merged (and optionally quantized) model is saved in safetensors or PyTorch format
The tokenizer and configuration files are saved alongside the model
An Ollama modelfile is generated if export_ollama_param is set
If configured, the model is pushed to HuggingFace Hub or ModelScope
The exported model is ready for direct loading by any HuggingFace-compatible framework

Execution Diagram

GitHub URL

Workflow Repository