Implementation:Ggml org Llama cpp Convert HF To GGUF Update

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Model_Conversion
Last Updated	2026-02-15 00:00 GMT

Overview

Developer tool that downloads tokenizer models from Hugging Face and generates the get_vocab_base_pre() function for convert_hf_to_gguf.py.

Description

This script downloads tokenizer files for specified models from Hugging Face (using optional auth tokens), analyzes the pre-tokenizer type (SPM, BPE, WPM, UGM), computes SHA256 hashes of the tokenizer vocabulary, and generates the lookup function that maps tokenizer hashes to pre-tokenizer identifiers. The TOKENIZER_TYPE enum defines the supported tokenizer families, and the script maintains the tokenizer identification system used during model conversion.

Usage

Run this script as a contributor-facing utility when adding support for a new model's tokenizer. It updates the pre-tokenizer mapping in the main conversion script, ensuring the correct pre-tokenizer is identified and embedded in the GGUF output.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: convert_hf_to_gguf_update.py
Lines: 1-480

Signature

class TOKENIZER_TYPE(IntEnum):
    SPM = auto()
    BPE = auto()
    WPM = auto()
    UGM = auto()

def download_file_with_auth(url, token, output_path):
    ...

def download_model(model_name, token, output_dir):
    ...

def get_existing_models(convert_py_text):
    ...

Import

import logging
import os
import pathlib
import re
import requests
import json
import shutil
import argparse
from hashlib import sha256
from enum import IntEnum, auto
from transformers import AutoTokenizer

I/O Contract

Inputs

Name	Type	Required	Description
--full	flag	No	Download full list of models (requires access to all)
--check-missing	flag	No	Only check for missing pre-tokenizer hashes
hf_token	str	No	HuggingFace authentication token (also read from ~/.cache/huggingface/token)

Outputs

Name	Type	Description
updated convert_hf_to_gguf.py	file	The main conversion script with updated get_vocab_base_pre() function
downloaded tokenizers	directory	Tokenizer model files downloaded from Hugging Face

Usage Examples

# Run with HuggingFace token to update tokenizer mappings
# python convert_hf_to_gguf_update.py hf_YOUR_TOKEN_HERE

# Check for missing tokenizer hashes only
# python convert_hf_to_gguf_update.py --check-missing

# Download full model list (requires access to all models)
# python convert_hf_to_gguf_update.py --full hf_YOUR_TOKEN_HERE

Related Pages

Principle:Ggml_org_Llama_cpp_ModelConversion

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment