Principle:Ggml org Llama cpp Conversion Verification
| Field | Value |
|---|---|
| Principle Name | Conversion Verification |
| Category | Quality Assurance |
| Scope | Validating model conversion correctness via logit comparison |
| Status | Active |
Overview
Description
After converting a model from HuggingFace format to GGUF, it is essential to verify that the converted model produces outputs consistent with the original. Without verification, conversion errors -- such as incorrect tensor mapping, dtype truncation artifacts, or vocabulary mismatches -- can go undetected and result in degraded or nonsensical model behavior.
Conversion verification compares the outputs of the original PyTorch model against the converted llama.cpp model for the same input. The comparison operates at two levels:
- Token-level verification: Confirms that both models tokenize the same input prompt into identical token sequences. Token mismatches indicate vocabulary extraction errors during conversion.
- Logit-level verification: Compares the raw logit vectors (pre-softmax output scores) produced by both models for the same input tokens. Logit comparison is more sensitive than text generation comparison because it captures numerical differences before they are masked by argmax or sampling.
Usage
Verification is the step immediately following conversion. The typical workflow is:
- Run the original PyTorch model on a reference prompt and save the output logits and token IDs to binary files
- Run the converted llama.cpp model on the same prompt and save its logits and token IDs
- Run the comparison script to check token equality and analyze logit differences
- If the lightweight check passes, proceed to more rigorous statistical tests (e.g., Normalized Mean Squared Error)
Theoretical Basis
Why Logits, Not Generated Text?
Comparing generated text is insufficient for verifying conversion correctness for several reasons:
- Sampling stochasticity: Text generation typically involves sampling (temperature, top-k, top-p), which means two runs of the same model can produce different text. Logits are deterministic for a given input.
- Error amplification: A small logit difference at one position can cause a different token to be selected, which then shifts the entire generated sequence. Logit comparison detects errors before this cascade occurs.
- Sensitivity: Two models can produce the same argmax token while having very different probability distributions. Logit comparison captures these distributional differences.
Metrics for Logit Comparison
The verification process uses several metrics, applied in stages from lightweight to rigorous:
Maximum absolute difference is the simplest metric:
max_diff = max(|pytorch_logits[i] - llamacpp_logits[i]|) for all i
This identifies the worst-case divergence. For float16 conversions, typical max differences are on the order of 0.01-0.1 due to the limited precision of half-precision arithmetic.
Top-k agreement checks whether the highest-ranked token predictions match between the two models. If the top 10 tokens are the same (even if their exact logit values differ slightly), the models are functionally equivalent for most generation scenarios.
Normalized Mean Squared Error (NMSE) provides a single scalar that summarizes the overall agreement:
NMSE = mean((pytorch_logits - llamacpp_logits)^2) / var(pytorch_logits)
NMSE values below a threshold (typically 1e-6 for f16, 1e-4 for q8_0) indicate acceptable conversion fidelity.
Token Verification as a Prerequisite
Token comparison must pass before logit comparison is meaningful. If the two models produce different token sequences for the same input text, their logit vectors correspond to different positions in the generation process and cannot be meaningfully compared. Token mismatches typically indicate:
- Missing or incorrectly ordered vocabulary entries
- Differences in special token handling (BOS, EOS, padding)
- Normalization differences in the tokenizer (NFKC, whitespace handling)
Version Sensitivity
The verification scripts include a transformers version check. If the installed transformers library version differs from the version specified in the model's config.json (transformers_version field), the reference logits may not match the model's published behavior. Version mismatches are flagged as a potential cause of verification failure.