Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Ollama Ollama Safetensors To GGUF Conversion

From Leeroopedia
Knowledge Sources
Domains LLMs, Model_Conversion, Data_Engineering
Last Updated 2026-02-14 22:00 GMT

Overview

End-to-end process for converting a model from HuggingFace SafeTensors or PyTorch format into Ollama's GGUF format, enabling it to run locally via Ollama.

Description

This workflow covers the conversion of externally trained models into the GGUF (GGML Universal Format) used by Ollama's inference engine. The conversion pipeline reads model weights from SafeTensors or PyTorch files, maps architecture-specific tensor names to GGUF conventions, extracts tokenizer configuration (vocabulary, merges, special tokens, chat template), writes model hyperparameters as GGUF metadata, and optionally quantizes the weights to a lower precision format. Ollama includes 25+ architecture-specific converters covering LLaMA, Gemma, Qwen, DeepSeek, Mistral, Phi, BERT, and more.

Usage

Execute this workflow when you have a model in HuggingFace SafeTensors or PyTorch format (typically downloaded from HuggingFace Hub) and want to run it locally with Ollama. This is necessary for models not yet available in the Ollama library, custom fine-tuned models, or when you need a specific quantization level.

Execution Steps

Step 1: Model Architecture Detection

Read the model's configuration file (config.json) to determine its architecture type. The configuration specifies the model class (e.g., LlamaForCausalLM, Gemma2ForCausalLM, Qwen2ForCausalLM), which maps to one of the 25+ architecture-specific converters in Ollama's convert package. Each converter knows the tensor name mappings and hyperparameter extraction logic for its architecture.

Key considerations:

  • The architectures field in config.json determines which converter to use
  • Some architectures have sub-variants (e.g., Gemma 2 vs Gemma 3, dense vs MoE)
  • Multimodal models (vision + text) require additional projector and vision encoder handling
  • Unsupported architectures will produce an error

Step 2: Tensor Reading and Name Mapping

Read all tensor data from the SafeTensors or PyTorch files and map each tensor's name from the source format to the GGUF naming convention. Each architecture converter defines a tensor name mapping table that translates HuggingFace-style names (e.g., model.layers.0.self_attn.q_proj.weight) to GGUF names (e.g., blk.0.attn_q.weight). Some converters also perform tensor transformations such as splitting or merging attention heads.

Key considerations:

  • SafeTensors files support zero-copy memory-mapped reading for efficiency
  • PyTorch files are read via pickle deserialization
  • Multi-file models (split across multiple .safetensors files) are handled transparently
  • Tensor data types (float16, bfloat16, float32) are preserved during reading

Step 3: Hyperparameter Extraction

Extract model hyperparameters from the configuration file and write them as GGUF key-value metadata. This includes architectural dimensions (hidden size, number of layers, number of attention heads, intermediate size), attention configuration (head dimensions, KV heads for GQA), positional encoding parameters (RoPE frequency base, scaling), and vocabulary size.

Key considerations:

  • Each architecture converter maps config.json fields to standardized GGUF keys
  • Some parameters require computation (e.g., GQA head counts from total heads)
  • Sliding window attention parameters are extracted when present
  • MoE-specific parameters (number of experts, top-k routing) are included for MoE architectures

Step 4: Tokenizer Extraction

Parse the tokenizer configuration from the model directory. Ollama supports three tokenizer source formats: HuggingFace tokenizer.json (BPE), SentencePiece .model files, and HuggingFace tokenizer_config.json. The extraction captures the full vocabulary (tokens, scores, types), merge rules, special tokens (BOS, EOS, padding, unknown), pre-tokenization rules, and the chat template string.

Key considerations:

  • BPE tokenizers read from tokenizer.json with vocabulary and merge rules
  • SentencePiece tokenizers read from the binary .model protobuf file
  • Added tokens (special tokens added after initial training) are merged into the vocabulary
  • The chat template (Jinja2 format from HuggingFace) is stored for prompt formatting

Step 5: GGUF File Assembly

Assemble all extracted components into a single GGUF file. The GGUF format consists of a header with metadata key-value pairs followed by tensor data. Metadata includes the architecture type, hyperparameters, tokenizer configuration, and alignment information. Tensor data is written sequentially with proper alignment padding. The resulting file is a self-contained model that Ollama can load directly.

Key considerations:

  • GGUF uses a fixed header format with magic number, version, and counts
  • Metadata keys follow a hierarchical naming convention (general.architecture, llama.embedding_length)
  • Tensor data is aligned to 32-byte boundaries for efficient memory-mapped access
  • The output file can be further quantized to reduce size

Step 6: Optional Quantization

Optionally quantize the model weights from their original precision (typically float16 or bfloat16) to a lower-precision format. Ollama supports multiple quantization levels (Q4_0, Q4_K_M, Q5_K_M, Q8_0, etc.) that trade off model quality for reduced memory usage and faster inference. Quantization is applied per-tensor with type-specific handling for different layer types.

Key considerations:

  • Different tensor types (attention weights, FFN weights, embeddings) may use different quantization levels
  • K-quant formats (Q4_K_M, Q5_K_M) provide better quality than basic quantization at the same bit width
  • Embedding and output tensors are typically kept at higher precision
  • 1-bit and 2-bit importance-based quantization (IQ) formats are available for extreme compression

Execution Diagram

GitHub URL

Workflow Repository