Workflow:Unslothai Unsloth Model Export And Deployment
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Model_Export, Deployment |
| Last Updated | 2026-02-07 09:00 GMT |
Overview
End-to-end process for exporting a fine-tuned Unsloth model to various deployment formats including merged SafeTensors, GGUF for llama.cpp, and HuggingFace Hub upload.
Description
This workflow covers the post-training pipeline for converting a fine-tuned model (with LoRA adapters) into deployment-ready formats. It handles the critical step of dequantizing 4-bit base weights, merging LoRA adapter deltas back into the full-precision model, and then converting to target formats. Supported export targets include merged 16-bit SafeTensors (for vLLM, SGLang, and HuggingFace inference), GGUF quantized formats (for llama.cpp, Ollama, and LM Studio), and direct Hub upload. The workflow also generates Ollama Modelfile templates with correct chat template mappings for 40+ model families.
Key capabilities:
- Lossless LoRA merge preserving model quality (validated via perplexity tests)
- GGUF export with 30+ quantization methods (q4_k_m, q8_0, f16, etc.)
- Automatic Ollama Modelfile generation with correct chat templates
- HuggingFace Hub upload with sharded SafeTensors support
- SentencePiece tokenizer GGUF compatibility fixes
Usage
Execute this workflow after completing model training (SFT, GRPO, or other) when you need to deploy the model for inference. Choose the appropriate export format based on your deployment target: merged SafeTensors for cloud GPU inference (vLLM, SGLang), GGUF for local CPU/GPU inference (llama.cpp, Ollama, LM Studio), or Hub upload for sharing and collaborative use.
Execution Steps
Step 1: LoRA Adapter Merge
Merge the trained LoRA adapter weights back into the base model. This process iterates over all model layers, dequantizes the 4-bit base weights to full precision, computes the LoRA delta (B @ A scaled by alpha/rank), adds it to the base weights, and produces a full-precision merged model. The merge handles edge cases like shared embedding/LM head weights and layernorm parameters.
Key considerations:
- Choose save_method: merged_16bit (recommended), merged_4bit, or lora (adapter only)
- merged_16bit produces the highest quality output by merging at full precision
- merged_4bit re-quantizes after merge for smaller model size
- lora saves only the adapter weights (requires base model at inference time)
Step 2: SafeTensors Export
Save the merged model in HuggingFace SafeTensors format with the model configuration, tokenizer, and all necessary metadata. The output directory contains a complete model that can be loaded by any HuggingFace-compatible framework. For large models, weights are automatically sharded into multiple SafeTensors files with an index.
Output contents:
- model.safetensors (or sharded model-00001-of-N.safetensors files)
- config.json, tokenizer.json, tokenizer_config.json
- generation_config.json, special_tokens_map.json
Step 3: GGUF Conversion
Convert the merged model to GGUF format for use with llama.cpp and compatible inference engines. This step invokes llama.cpp's convert_hf_to_gguf.py to translate the HuggingFace model into GGUF binary format, then applies quantization to reduce model size. Unsloth automatically installs and builds llama.cpp if not present, and fixes SentencePiece tokenizer compatibility issues.
Quantization options:
- f16: Full 16-bit precision (largest, best quality)
- q8_0: 8-bit quantization (good quality, moderate size)
- q4_k_m: 4-bit K-means quantization (balanced quality/size)
- q4_0, q5_0, q5_k_m, q6_k: various quality-size tradeoffs
- Multiple quantizations can be applied from a single merge
Step 4: Ollama Template Generation
Generate an Ollama Modelfile with the correct chat template for the model family. Unsloth maintains a mapping of over 40 model families to their Ollama template syntax, ensuring that deployed models use the correct prompt formatting for inference. The Modelfile includes the model path, template definition, stop tokens, and recommended sampling parameters.
Key considerations:
- Template is auto-selected based on the model architecture
- Stop tokens are correctly configured for each model family
- Sampling parameters (temperature, min_p) are set to recommended defaults
Step 5: HuggingFace Hub Upload
Upload the exported model to HuggingFace Hub for sharing and deployment. This supports both merged SafeTensors and GGUF uploads. The upload process creates a repository, handles large file uploads via LFS, and publishes model cards with training metadata.
Upload options:
- push_to_hub_merged: upload merged SafeTensors model
- push_to_hub_gguf: upload GGUF quantized model
- Supports private and public repositories
- Handles authentication via HuggingFace token