Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Hiyouga LLaMA Factory Model Export and Merging

From Leeroopedia


Knowledge Sources
Domains LLMs, Model_Export, Quantization, Deployment
Last Updated 2026-02-06 19:00 GMT

Overview

End-to-end process for merging LoRA adapter weights into a base model and exporting the result in various formats including HuggingFace, GGUF, and quantized formats.

Description

This workflow handles the post-training step of preparing a fine-tuned model for deployment. When using LoRA training, the adapter weights are stored separately from the base model. This workflow merges those adapter weights into the base model to produce a standalone model, then optionally applies post-training quantization (GPTQ, AWQ, AutoRound) for reduced model size and faster inference. The export function also supports generating Ollama modelfiles for local deployment and pushing models directly to HuggingFace Hub or ModelScope.

Usage

Execute this workflow after LoRA-based training is complete and you need a standalone deployment-ready model. Common scenarios include: merging LoRA adapters for frameworks that do not support adapter loading, applying GPTQ quantization for reduced memory deployment, exporting to GGUF format for llama.cpp or Ollama, or publishing the merged model to HuggingFace Hub.

Execution Steps

Step 1: Configuration

Define the export job with a YAML configuration specifying the base model, adapter checkpoint path, export directory, and optional quantization settings. The configuration uses the llamafactory-cli export command rather than the training command.

Key considerations:

  • Set model_name_or_path to the original base model
  • Set adapter_name_or_path to the trained LoRA adapter checkpoint
  • Set export_dir for the output merged model path
  • For GPTQ quantization, set export_quantization_bit and export_quantization_dataset
  • Set template to match the model's chat format
  • export_legacy_format: false uses safetensors format

Step 2: Argument Parsing

Parse the export configuration and validate the model-adapter compatibility. The parser verifies that the adapter was trained on the specified base model and resolves the export format settings.

What happens:

  • Arguments are parsed into ModelArguments and FinetuningArguments
  • The adapter metadata is checked for compatibility with the base model
  • Export format and quantization settings are validated
  • The device map is configured for the export process

Step 3: Model and Adapter Loading

Load the base model and the trained LoRA adapter weights. The model is loaded at full precision (or the training precision) to ensure accurate weight merging. Multiple adapters can be loaded if the training used adapter stacking.

What happens:

  • The base model is loaded at the configured precision
  • LoRA adapter weights are loaded from the checkpoint directory
  • For multiple adapters, they are loaded in sequence
  • The tokenizer and processor are loaded with any custom tokens added during training

Step 4: Weight Merging

Merge the LoRA adapter matrices into the base model weights. For each adapted layer, the effective weight W' = W + BA is computed and replaces the original weight W. After merging, the model no longer requires the adapter files and behaves as a standard pre-trained model.

What happens:

  • For each LoRA-adapted layer, the low-rank matrices A and B are multiplied and added to the original weight
  • The PEFT wrapper is removed, leaving a standard model
  • For PiSSA adapters, the residual model is used as the base for merging
  • The merged model's architecture is identical to the original base model

Step 5: Post-Training Quantization (Optional)

Apply post-training quantization to reduce the merged model's size and memory footprint. GPTQ quantization uses a calibration dataset to determine optimal quantization parameters for each layer.

Key considerations:

  • GPTQ: Requires a calibration dataset, produces models compatible with AutoGPTQ and Transformers
  • AWQ: Activation-aware quantization with similar calibration requirements
  • AutoRound: Automatic quantization with built-in calibration
  • Quantization bit width is typically 4 or 8 bits
  • The quantized model can be significantly smaller (e.g., 7B model from ~14GB to ~4GB at 4-bit)

Step 6: Export and Save

Save the final model in the requested format to the export directory. The export function also handles format conversion and optional model hub uploading.

What happens:

  • The merged (and optionally quantized) model is saved in safetensors or PyTorch format
  • The tokenizer and configuration files are saved alongside the model
  • An Ollama modelfile is generated if export_ollama_param is set
  • If configured, the model is pushed to HuggingFace Hub or ModelScope
  • The exported model is ready for direct loading by any HuggingFace-compatible framework

Execution Diagram

GitHub URL

Workflow Repository