Workflow:Hiyouga LLaMA Factory Model Export and Merging
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Model_Export, Quantization, Deployment |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
End-to-end process for merging LoRA adapter weights into a base model and exporting the result in various formats including HuggingFace, GGUF, and quantized formats.
Description
This workflow handles the post-training step of preparing a fine-tuned model for deployment. When using LoRA training, the adapter weights are stored separately from the base model. This workflow merges those adapter weights into the base model to produce a standalone model, then optionally applies post-training quantization (GPTQ, AWQ, AutoRound) for reduced model size and faster inference. The export function also supports generating Ollama modelfiles for local deployment and pushing models directly to HuggingFace Hub or ModelScope.
Usage
Execute this workflow after LoRA-based training is complete and you need a standalone deployment-ready model. Common scenarios include: merging LoRA adapters for frameworks that do not support adapter loading, applying GPTQ quantization for reduced memory deployment, exporting to GGUF format for llama.cpp or Ollama, or publishing the merged model to HuggingFace Hub.
Execution Steps
Step 1: Configuration
Define the export job with a YAML configuration specifying the base model, adapter checkpoint path, export directory, and optional quantization settings. The configuration uses the llamafactory-cli export command rather than the training command.
Key considerations:
- Set
model_name_or_pathto the original base model - Set
adapter_name_or_pathto the trained LoRA adapter checkpoint - Set
export_dirfor the output merged model path - For GPTQ quantization, set
export_quantization_bitandexport_quantization_dataset - Set
templateto match the model's chat format export_legacy_format: falseuses safetensors format
Step 2: Argument Parsing
Parse the export configuration and validate the model-adapter compatibility. The parser verifies that the adapter was trained on the specified base model and resolves the export format settings.
What happens:
- Arguments are parsed into ModelArguments and FinetuningArguments
- The adapter metadata is checked for compatibility with the base model
- Export format and quantization settings are validated
- The device map is configured for the export process
Step 3: Model and Adapter Loading
Load the base model and the trained LoRA adapter weights. The model is loaded at full precision (or the training precision) to ensure accurate weight merging. Multiple adapters can be loaded if the training used adapter stacking.
What happens:
- The base model is loaded at the configured precision
- LoRA adapter weights are loaded from the checkpoint directory
- For multiple adapters, they are loaded in sequence
- The tokenizer and processor are loaded with any custom tokens added during training
Step 4: Weight Merging
Merge the LoRA adapter matrices into the base model weights. For each adapted layer, the effective weight W' = W + BA is computed and replaces the original weight W. After merging, the model no longer requires the adapter files and behaves as a standard pre-trained model.
What happens:
- For each LoRA-adapted layer, the low-rank matrices A and B are multiplied and added to the original weight
- The PEFT wrapper is removed, leaving a standard model
- For PiSSA adapters, the residual model is used as the base for merging
- The merged model's architecture is identical to the original base model
Step 5: Post-Training Quantization (Optional)
Apply post-training quantization to reduce the merged model's size and memory footprint. GPTQ quantization uses a calibration dataset to determine optimal quantization parameters for each layer.
Key considerations:
- GPTQ: Requires a calibration dataset, produces models compatible with AutoGPTQ and Transformers
- AWQ: Activation-aware quantization with similar calibration requirements
- AutoRound: Automatic quantization with built-in calibration
- Quantization bit width is typically 4 or 8 bits
- The quantized model can be significantly smaller (e.g., 7B model from ~14GB to ~4GB at 4-bit)
Step 6: Export and Save
Save the final model in the requested format to the export directory. The export function also handles format conversion and optional model hub uploading.
What happens:
- The merged (and optionally quantized) model is saved in safetensors or PyTorch format
- The tokenizer and configuration files are saved alongside the model
- An Ollama modelfile is generated if export_ollama_param is set
- If configured, the model is pushed to HuggingFace Hub or ModelScope
- The exported model is ready for direct loading by any HuggingFace-compatible framework