Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp Convert HF To GGUF Update

From Leeroopedia
Knowledge Sources
Domains Model_Conversion
Last Updated 2026-02-15 00:00 GMT

Overview

Developer tool that downloads tokenizer models from Hugging Face and generates the get_vocab_base_pre() function for convert_hf_to_gguf.py.

Description

This script downloads tokenizer files for specified models from Hugging Face (using optional auth tokens), analyzes the pre-tokenizer type (SPM, BPE, WPM, UGM), computes SHA256 hashes of the tokenizer vocabulary, and generates the lookup function that maps tokenizer hashes to pre-tokenizer identifiers. The TOKENIZER_TYPE enum defines the supported tokenizer families, and the script maintains the tokenizer identification system used during model conversion.

Usage

Run this script as a contributor-facing utility when adding support for a new model's tokenizer. It updates the pre-tokenizer mapping in the main conversion script, ensuring the correct pre-tokenizer is identified and embedded in the GGUF output.

Code Reference

Source Location

Signature

class TOKENIZER_TYPE(IntEnum):
    SPM = auto()
    BPE = auto()
    WPM = auto()
    UGM = auto()

def download_file_with_auth(url, token, output_path):
    ...

def download_model(model_name, token, output_dir):
    ...

def get_existing_models(convert_py_text):
    ...

Import

import logging
import os
import pathlib
import re
import requests
import json
import shutil
import argparse
from hashlib import sha256
from enum import IntEnum, auto
from transformers import AutoTokenizer

I/O Contract

Inputs

Name Type Required Description
--full flag No Download full list of models (requires access to all)
--check-missing flag No Only check for missing pre-tokenizer hashes
hf_token str No HuggingFace authentication token (also read from ~/.cache/huggingface/token)

Outputs

Name Type Description
updated convert_hf_to_gguf.py file The main conversion script with updated get_vocab_base_pre() function
downloaded tokenizers directory Tokenizer model files downloaded from Hugging Face

Usage Examples

# Run with HuggingFace token to update tokenizer mappings
# python convert_hf_to_gguf_update.py hf_YOUR_TOKEN_HERE

# Check for missing tokenizer hashes only
# python convert_hf_to_gguf_update.py --check-missing

# Download full model list (requires access to all models)
# python convert_hf_to_gguf_update.py --full hf_YOUR_TOKEN_HERE

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment