Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Ggml org Llama cpp HF to GGUF Model Conversion

From Leeroopedia
Knowledge Sources
Domains LLMs, Model_Conversion, GGUF
Last Updated 2026-02-14 22:00 GMT

Overview

End-to-end process for converting HuggingFace transformer models into the GGUF format used by llama.cpp for efficient inference.

Description

This workflow covers the complete pipeline for taking a pre-trained model from the HuggingFace Hub (in PyTorch safetensors or bin format) and converting it to the GGUF (GGML Universal File) format. The conversion process maps architecture-specific tensor names and layouts to the standardized naming scheme used by llama.cpp, embeds tokenizer metadata, and writes the result as a single portable binary file. The converter supports 80+ model architectures including LLaMA, Mistral, Qwen, Gemma, Phi, RWKV, Mamba, DeepSeek, and many more. The output GGUF file can then be used directly with llama.cpp tools for inference, quantization, or serving.

Usage

Execute this workflow when you have a model on the HuggingFace Hub (or locally in HuggingFace format with config.json and safetensors/bin weight files) and need to use it with llama.cpp. This is the foundational step before quantization, inference, or deployment through any llama.cpp tool.

Execution Steps

Step 1: Environment Setup

Prepare a Python environment with the required dependencies. The conversion scripts depend on the gguf-py package (included in the repository), PyTorch, NumPy, SentencePiece, and the HuggingFace transformers library. Create a virtual environment and install dependencies from the provided requirements files.

Key considerations:

  • Python 3.11 or later is recommended
  • The gguf-py package must be installed from the repository's gguf-py directory
  • Some model architectures may require additional dependencies (e.g., tiktoken for certain tokenizers)

Step 2: Obtain the Source Model

Download the pre-trained model from HuggingFace Hub using git-lfs or the HuggingFace CLI. The model directory must contain the configuration file (config.json), tokenizer files, and weight files (safetensors or pytorch bin format).

Key considerations:

  • Ensure git-lfs is installed for downloading large weight files
  • Some models have gated access requiring a HuggingFace token
  • Verify the model architecture is among the 80+ supported architectures

Step 3: Inspect the Source Model

Examine the original model's tensor structure and metadata to verify compatibility and understand the architecture. This step uses utility scripts to list tensor names, shapes, and data types from the source model files.

Key considerations:

  • Check that the model architecture is recognized by the converter
  • Note the original precision (typically FP16 or BF16) for comparison
  • Identify any non-standard layers or custom architectures that may need special handling

Step 4: Run the Conversion Script

Execute the main conversion script (convert_hf_to_gguf.py) which reads the HuggingFace model files, maps tensor names to GGUF conventions, converts tokenizer and metadata, and writes the output GGUF file. The script automatically detects the model architecture from config.json and selects the appropriate conversion class.

Key considerations:

  • The output precision defaults to FP16 but can be set to FP32 or BF16
  • Vocabulary type (BPE, SentencePiece, WordPiece) is auto-detected
  • Chat templates embedded in the tokenizer config are preserved in the GGUF metadata
  • Large models may require significant RAM during conversion

Step 5: Verify the Converted Model

Validate the converted GGUF file by inspecting its metadata and tensor structure, then run a test inference to compare outputs against the original model. The verification scripts generate logits from both the original and converted models on the same input and compare them.

Key considerations:

  • Logit comparison should show near-zero divergence for FP16 conversion
  • Use the debug tool to dump embeddings and logits for detailed comparison
  • Check that special tokens (BOS, EOS, padding) are correctly mapped

Step 6: Upload or Distribute

Optionally upload the converted GGUF file to HuggingFace Hub for distribution. Utility scripts are provided for creating model repositories, uploading files, and organizing models into collections.

Key considerations:

  • GGUF files can be very large; consider quantizing before distribution
  • Use the gguf-split tool for files exceeding platform size limits
  • Include model card metadata for discoverability

Execution Diagram

GitHub URL

Workflow Repository