Implementation:Ggml org Llama cpp Convert HF To GGUF Update
| Knowledge Sources | |
|---|---|
| Domains | Model_Conversion |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Developer tool that downloads tokenizer models from Hugging Face and generates the get_vocab_base_pre() function for convert_hf_to_gguf.py.
Description
This script downloads tokenizer files for specified models from Hugging Face (using optional auth tokens), analyzes the pre-tokenizer type (SPM, BPE, WPM, UGM), computes SHA256 hashes of the tokenizer vocabulary, and generates the lookup function that maps tokenizer hashes to pre-tokenizer identifiers. The TOKENIZER_TYPE enum defines the supported tokenizer families, and the script maintains the tokenizer identification system used during model conversion.
Usage
Run this script as a contributor-facing utility when adding support for a new model's tokenizer. It updates the pre-tokenizer mapping in the main conversion script, ensuring the correct pre-tokenizer is identified and embedded in the GGUF output.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: convert_hf_to_gguf_update.py
- Lines: 1-480
Signature
class TOKENIZER_TYPE(IntEnum):
SPM = auto()
BPE = auto()
WPM = auto()
UGM = auto()
def download_file_with_auth(url, token, output_path):
...
def download_model(model_name, token, output_dir):
...
def get_existing_models(convert_py_text):
...
Import
import logging
import os
import pathlib
import re
import requests
import json
import shutil
import argparse
from hashlib import sha256
from enum import IntEnum, auto
from transformers import AutoTokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --full | flag | No | Download full list of models (requires access to all) |
| --check-missing | flag | No | Only check for missing pre-tokenizer hashes |
| hf_token | str | No | HuggingFace authentication token (also read from ~/.cache/huggingface/token) |
Outputs
| Name | Type | Description |
|---|---|---|
| updated convert_hf_to_gguf.py | file | The main conversion script with updated get_vocab_base_pre() function |
| downloaded tokenizers | directory | Tokenizer model files downloaded from Hugging Face |
Usage Examples
# Run with HuggingFace token to update tokenizer mappings
# python convert_hf_to_gguf_update.py hf_YOUR_TOKEN_HERE
# Check for missing tokenizer hashes only
# python convert_hf_to_gguf_update.py --check-missing
# Download full model list (requires access to all models)
# python convert_hf_to_gguf_update.py --full hf_YOUR_TOKEN_HERE