Implementation:Turboderp org Exllamav2 Model Diff

Knowledge Sources	Turboderp_org_Exllamav2
Domains	Evaluation, Model_Analysis
Last Updated	2026-02-15 00:00 GMT

Overview

CLI tool for performing layer-by-layer comparison between two ExLlamaV2 models, computing relative Frobenius norm error, perplexity, top-K accuracy/agreement, KL divergence, and MSE metrics.

Description

model_diff.py is a standalone script that loads two models (A and B) and compares their hidden state representations and output distributions across evaluation data.

Command-line arguments:

-ma / --model_a -- Path to the first (reference) model.
-mb / --model_b -- Path to the second (comparison) model.
-ed / --eval_dataset -- Path to a Parquet evaluation dataset.
-er / --eval_rows -- Number of dataset rows to evaluate (default: 20).
-el / --eval_length -- Maximum tokens per sample (default: 2048).
-k / --keep_layers -- Number of initial layers where model B uses model A's hidden states (default: 0), enabling layer-swap analysis.
-tkm / --topk_max -- Maximum top-K interval to test (default: 5).

Processing pipeline:

1. Both models are loaded lazily (load(lazy=True)) to minimize memory usage. Modules are loaded/unloaded one at a time.

2. Embeddings are computed for all evaluation rows through each model's embedding layer.

3. For each subsequent module (layer), the script:

- Loads the module weights for both models.
- Performs a forward pass through the module for each evaluation row.
- If keep_layers is set and the current layer index is within that range, model B receives model A's hidden state (layer swapping).
- Computes the relative Frobenius norm (rfn_error) between model A's and B's hidden states: ||y - x||_F / ||x||_F, averaged across all rows.
- Unloads the module to free memory.

4. After all layers, the script evaluates final outputs:

- Perplexity for both models using log-softmax and gather on target tokens, processed in chunks to manage memory.
- Top-K accuracy for K=1 through topk_max: what fraction of target tokens appear in the top-K predictions.
- Top-K agreement: fraction of positions where models A and B produce identical top-K sets.
- KL divergence between the output probability distributions.
- MSE between the output probability distributions.

5. Results are printed in both CSV format and human-readable format.

Usage

This tool is used to evaluate the quality impact of quantization, pruning, or other model modifications by comparing a modified model against a reference. The layer-by-layer rfn_error shows where divergence accumulates, while the output metrics show the end-to-end impact on generation quality.

Code Reference

Source Location

Repository: Turboderp_org_Exllamav2
File: model_diff.py
Lines: 1-261

Signature

# CLI argument parser
parser = argparse.ArgumentParser(
    description="Test layer-by-layer hidden state difference between two models"
)
parser.add_argument("-ed", "--eval_dataset", type=str)
parser.add_argument("-er", "--eval_rows", type=int, default=20)
parser.add_argument("-el", "--eval_length", type=int, default=2048)
parser.add_argument("-ma", "--model_a", type=str)
parser.add_argument("-mb", "--model_b", type=str)
parser.add_argument("-k", "--keep_layers", type=int, default=0)
parser.add_argument("-tkm", "--topk_max", type=int, default=5)

# Internal helper
def ppl(input_ids_, logits_) -> tuple[float, int]:
    ...

Import

# Script executed directly via CLI
python model_diff.py -ma /path/to/model_a -mb /path/to/model_b -ed eval_data.parquet

I/O Contract

Argument	Type	Required	Description
-ma / --model_a	str	Yes	Path to the reference model directory
-mb / --model_b	str	Yes	Path to the comparison model directory
-ed / --eval_dataset	str	Yes	Path to Parquet evaluation dataset
-er / --eval_rows	int	No (default: 20)	Number of rows to evaluate
-el / --eval_length	int	No (default: 2048)	Maximum token count per sample
-k / --keep_layers	int	No (default: 0)	Layers where B inherits A's state
-tkm / --topk_max	int	No (default: 5)	Maximum K for top-K metrics

Output Metric	Description
rfn_error	Per-layer relative Frobenius norm:	B - A	_F /	A	_F
Perplexity (A, B)	Per-model perplexity on evaluation data
Top-K accuracy (A, B)	Fraction of targets in top-K predictions for K=1..topk_max
Top-K agreement	Fraction of positions with identical top-K sets across models
KL divergence	KL(A	B) averaged over tokens and rows
MSE	Mean squared error between output probability distributions

Usage Examples

# Compare a quantized model against the original
# python model_diff.py \
#     -ma /models/llama-7b \
#     -mb /models/llama-7b-4bit-exl2 \
#     -ed /data/wikitext-test.parquet \
#     -er 50 \
#     -el 2048 \
#     -tkm 10

# Layer-swap analysis: keep first 5 layers from model A
# python model_diff.py \
#     -ma /models/llama-7b \
#     -mb /models/llama-7b-4bit-exl2 \
#     -ed /data/wikitext-test.parquet \
#     -k 5

Related Pages

Turboderp_org_Exllamav2_FPx_Quantization -- Quantization utilities whose quality can be evaluated with this tool
Turboderp_org_Exllamav2_Shard -- Model file management for large model comparisons

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment