Workflow:Unslothai Unsloth Model Export And Deployment

Knowledge Sources	Unsloth Unsloth Docs Saving to GGUF Deployment Guide
Domains	LLMs, Model_Export, Deployment
Last Updated	2026-02-07 09:00 GMT

Overview

End-to-end process for exporting a fine-tuned Unsloth model to various deployment formats including merged SafeTensors, GGUF for llama.cpp, and HuggingFace Hub upload.

Description

This workflow covers the post-training pipeline for converting a fine-tuned model (with LoRA adapters) into deployment-ready formats. It handles the critical step of dequantizing 4-bit base weights, merging LoRA adapter deltas back into the full-precision model, and then converting to target formats. Supported export targets include merged 16-bit SafeTensors (for vLLM, SGLang, and HuggingFace inference), GGUF quantized formats (for llama.cpp, Ollama, and LM Studio), and direct Hub upload. The workflow also generates Ollama Modelfile templates with correct chat template mappings for 40+ model families.

Key capabilities:

Lossless LoRA merge preserving model quality (validated via perplexity tests)
GGUF export with 30+ quantization methods (q4_k_m, q8_0, f16, etc.)
Automatic Ollama Modelfile generation with correct chat templates
HuggingFace Hub upload with sharded SafeTensors support
SentencePiece tokenizer GGUF compatibility fixes

Usage

Execute this workflow after completing model training (SFT, GRPO, or other) when you need to deploy the model for inference. Choose the appropriate export format based on your deployment target: merged SafeTensors for cloud GPU inference (vLLM, SGLang), GGUF for local CPU/GPU inference (llama.cpp, Ollama, LM Studio), or Hub upload for sharing and collaborative use.

Execution Steps

Step 1: LoRA Adapter Merge

Merge the trained LoRA adapter weights back into the base model. This process iterates over all model layers, dequantizes the 4-bit base weights to full precision, computes the LoRA delta (B @ A scaled by alpha/rank), adds it to the base weights, and produces a full-precision merged model. The merge handles edge cases like shared embedding/LM head weights and layernorm parameters.

Key considerations:

Choose save_method: merged_16bit (recommended), merged_4bit, or lora (adapter only)
merged_16bit produces the highest quality output by merging at full precision
merged_4bit re-quantizes after merge for smaller model size
lora saves only the adapter weights (requires base model at inference time)

Step 2: SafeTensors Export

Save the merged model in HuggingFace SafeTensors format with the model configuration, tokenizer, and all necessary metadata. The output directory contains a complete model that can be loaded by any HuggingFace-compatible framework. For large models, weights are automatically sharded into multiple SafeTensors files with an index.

Output contents:

model.safetensors (or sharded model-00001-of-N.safetensors files)
config.json, tokenizer.json, tokenizer_config.json
generation_config.json, special_tokens_map.json

Step 3: GGUF Conversion

Convert the merged model to GGUF format for use with llama.cpp and compatible inference engines. This step invokes llama.cpp's convert_hf_to_gguf.py to translate the HuggingFace model into GGUF binary format, then applies quantization to reduce model size. Unsloth automatically installs and builds llama.cpp if not present, and fixes SentencePiece tokenizer compatibility issues.

Quantization options:

f16: Full 16-bit precision (largest, best quality)
q8_0: 8-bit quantization (good quality, moderate size)
q4_k_m: 4-bit K-means quantization (balanced quality/size)
q4_0, q5_0, q5_k_m, q6_k: various quality-size tradeoffs
Multiple quantizations can be applied from a single merge

Step 4: Ollama Template Generation

Generate an Ollama Modelfile with the correct chat template for the model family. Unsloth maintains a mapping of over 40 model families to their Ollama template syntax, ensuring that deployed models use the correct prompt formatting for inference. The Modelfile includes the model path, template definition, stop tokens, and recommended sampling parameters.

Key considerations:

Template is auto-selected based on the model architecture
Stop tokens are correctly configured for each model family
Sampling parameters (temperature, min_p) are set to recommended defaults

Step 5: HuggingFace Hub Upload

Upload the exported model to HuggingFace Hub for sharing and deployment. This supports both merged SafeTensors and GGUF uploads. The upload process creates a repository, handles large file uploads via LFS, and publishes model cards with training metadata.

Upload options:

push_to_hub_merged: upload merged SafeTensors model
push_to_hub_gguf: upload GGUF quantized model
Supports private and public repositories
Handles authentication via HuggingFace token

Execution Diagram

GitHub URL

Workflow Repository