Implementation:Turboderp org Exllamav2 Model Diff
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Model_Analysis |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
CLI tool for performing layer-by-layer comparison between two ExLlamaV2 models, computing relative Frobenius norm error, perplexity, top-K accuracy/agreement, KL divergence, and MSE metrics.
Description
model_diff.py is a standalone script that loads two models (A and B) and compares their hidden state representations and output distributions across evaluation data.
Command-line arguments:
- -ma / --model_a -- Path to the first (reference) model.
- -mb / --model_b -- Path to the second (comparison) model.
- -ed / --eval_dataset -- Path to a Parquet evaluation dataset.
- -er / --eval_rows -- Number of dataset rows to evaluate (default: 20).
- -el / --eval_length -- Maximum tokens per sample (default: 2048).
- -k / --keep_layers -- Number of initial layers where model B uses model A's hidden states (default: 0), enabling layer-swap analysis.
- -tkm / --topk_max -- Maximum top-K interval to test (default: 5).
Processing pipeline:
1. Both models are loaded lazily (load(lazy=True)) to minimize memory usage. Modules are loaded/unloaded one at a time.
2. Embeddings are computed for all evaluation rows through each model's embedding layer.
3. For each subsequent module (layer), the script:
- Loads the module weights for both models.
- Performs a forward pass through the module for each evaluation row.
- If keep_layers is set and the current layer index is within that range, model B receives model A's hidden state (layer swapping).
- Computes the relative Frobenius norm (rfn_error) between model A's and B's hidden states: ||y - x||_F / ||x||_F, averaged across all rows.
- Unloads the module to free memory.
4. After all layers, the script evaluates final outputs:
- Perplexity for both models using log-softmax and gather on target tokens, processed in chunks to manage memory.
- Top-K accuracy for K=1 through topk_max: what fraction of target tokens appear in the top-K predictions.
- Top-K agreement: fraction of positions where models A and B produce identical top-K sets.
- KL divergence between the output probability distributions.
- MSE between the output probability distributions.
5. Results are printed in both CSV format and human-readable format.
Usage
This tool is used to evaluate the quality impact of quantization, pruning, or other model modifications by comparing a modified model against a reference. The layer-by-layer rfn_error shows where divergence accumulates, while the output metrics show the end-to-end impact on generation quality.
Code Reference
Source Location
- Repository: Turboderp_org_Exllamav2
- File: model_diff.py
- Lines: 1-261
Signature
# CLI argument parser
parser = argparse.ArgumentParser(
description="Test layer-by-layer hidden state difference between two models"
)
parser.add_argument("-ed", "--eval_dataset", type=str)
parser.add_argument("-er", "--eval_rows", type=int, default=20)
parser.add_argument("-el", "--eval_length", type=int, default=2048)
parser.add_argument("-ma", "--model_a", type=str)
parser.add_argument("-mb", "--model_b", type=str)
parser.add_argument("-k", "--keep_layers", type=int, default=0)
parser.add_argument("-tkm", "--topk_max", type=int, default=5)
# Internal helper
def ppl(input_ids_, logits_) -> tuple[float, int]:
...
Import
# Script executed directly via CLI
python model_diff.py -ma /path/to/model_a -mb /path/to/model_b -ed eval_data.parquet
I/O Contract
| Argument | Type | Required | Description |
|---|---|---|---|
| -ma / --model_a | str | Yes | Path to the reference model directory |
| -mb / --model_b | str | Yes | Path to the comparison model directory |
| -ed / --eval_dataset | str | Yes | Path to Parquet evaluation dataset |
| -er / --eval_rows | int | No (default: 20) | Number of rows to evaluate |
| -el / --eval_length | int | No (default: 2048) | Maximum token count per sample |
| -k / --keep_layers | int | No (default: 0) | Layers where B inherits A's state |
| -tkm / --topk_max | int | No (default: 5) | Maximum K for top-K metrics |
| Output Metric | Description | ||||
|---|---|---|---|---|---|
| rfn_error | Per-layer relative Frobenius norm: | B - A | _F / | A | _F |
| Perplexity (A, B) | Per-model perplexity on evaluation data | ||||
| Top-K accuracy (A, B) | Fraction of targets in top-K predictions for K=1..topk_max | ||||
| Top-K agreement | Fraction of positions with identical top-K sets across models | ||||
| KL divergence | KL(A | B) averaged over tokens and rows | |||
| MSE | Mean squared error between output probability distributions |
Usage Examples
# Compare a quantized model against the original
# python model_diff.py \
# -ma /models/llama-7b \
# -mb /models/llama-7b-4bit-exl2 \
# -ed /data/wikitext-test.parquet \
# -er 50 \
# -el 2048 \
# -tkm 10
# Layer-swap analysis: keep first 5 layers from model A
# python model_diff.py \
# -ma /models/llama-7b \
# -mb /models/llama-7b-4bit-exl2 \
# -ed /data/wikitext-test.parquet \
# -k 5
Related Pages
- Turboderp_org_Exllamav2_FPx_Quantization -- Quantization utilities whose quality can be evaluated with this tool
- Turboderp_org_Exllamav2_Shard -- Model file management for large model comparisons