Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Turboderp org Exllamav2 Model Diff

From Leeroopedia
Knowledge Sources
Domains Evaluation, Model_Analysis
Last Updated 2026-02-15 00:00 GMT

Overview

CLI tool for performing layer-by-layer comparison between two ExLlamaV2 models, computing relative Frobenius norm error, perplexity, top-K accuracy/agreement, KL divergence, and MSE metrics.

Description

model_diff.py is a standalone script that loads two models (A and B) and compares their hidden state representations and output distributions across evaluation data.

Command-line arguments:

  • -ma / --model_a -- Path to the first (reference) model.
  • -mb / --model_b -- Path to the second (comparison) model.
  • -ed / --eval_dataset -- Path to a Parquet evaluation dataset.
  • -er / --eval_rows -- Number of dataset rows to evaluate (default: 20).
  • -el / --eval_length -- Maximum tokens per sample (default: 2048).
  • -k / --keep_layers -- Number of initial layers where model B uses model A's hidden states (default: 0), enabling layer-swap analysis.
  • -tkm / --topk_max -- Maximum top-K interval to test (default: 5).

Processing pipeline:

1. Both models are loaded lazily (load(lazy=True)) to minimize memory usage. Modules are loaded/unloaded one at a time.

2. Embeddings are computed for all evaluation rows through each model's embedding layer.

3. For each subsequent module (layer), the script:

    • Loads the module weights for both models.
    • Performs a forward pass through the module for each evaluation row.
    • If keep_layers is set and the current layer index is within that range, model B receives model A's hidden state (layer swapping).
    • Computes the relative Frobenius norm (rfn_error) between model A's and B's hidden states: ||y - x||_F / ||x||_F, averaged across all rows.
    • Unloads the module to free memory.

4. After all layers, the script evaluates final outputs:

    • Perplexity for both models using log-softmax and gather on target tokens, processed in chunks to manage memory.
    • Top-K accuracy for K=1 through topk_max: what fraction of target tokens appear in the top-K predictions.
    • Top-K agreement: fraction of positions where models A and B produce identical top-K sets.
    • KL divergence between the output probability distributions.
    • MSE between the output probability distributions.

5. Results are printed in both CSV format and human-readable format.

Usage

This tool is used to evaluate the quality impact of quantization, pruning, or other model modifications by comparing a modified model against a reference. The layer-by-layer rfn_error shows where divergence accumulates, while the output metrics show the end-to-end impact on generation quality.

Code Reference

Source Location

Signature

# CLI argument parser
parser = argparse.ArgumentParser(
    description="Test layer-by-layer hidden state difference between two models"
)
parser.add_argument("-ed", "--eval_dataset", type=str)
parser.add_argument("-er", "--eval_rows", type=int, default=20)
parser.add_argument("-el", "--eval_length", type=int, default=2048)
parser.add_argument("-ma", "--model_a", type=str)
parser.add_argument("-mb", "--model_b", type=str)
parser.add_argument("-k", "--keep_layers", type=int, default=0)
parser.add_argument("-tkm", "--topk_max", type=int, default=5)

# Internal helper
def ppl(input_ids_, logits_) -> tuple[float, int]:
    ...

Import

# Script executed directly via CLI
python model_diff.py -ma /path/to/model_a -mb /path/to/model_b -ed eval_data.parquet

I/O Contract

Argument Type Required Description
-ma / --model_a str Yes Path to the reference model directory
-mb / --model_b str Yes Path to the comparison model directory
-ed / --eval_dataset str Yes Path to Parquet evaluation dataset
-er / --eval_rows int No (default: 20) Number of rows to evaluate
-el / --eval_length int No (default: 2048) Maximum token count per sample
-k / --keep_layers int No (default: 0) Layers where B inherits A's state
-tkm / --topk_max int No (default: 5) Maximum K for top-K metrics
Output Metric Description
rfn_error Per-layer relative Frobenius norm: B - A _F / A _F
Perplexity (A, B) Per-model perplexity on evaluation data
Top-K accuracy (A, B) Fraction of targets in top-K predictions for K=1..topk_max
Top-K agreement Fraction of positions with identical top-K sets across models
KL divergence KL(A B) averaged over tokens and rows
MSE Mean squared error between output probability distributions

Usage Examples

# Compare a quantized model against the original
# python model_diff.py \
#     -ma /models/llama-7b \
#     -mb /models/llama-7b-4bit-exl2 \
#     -ed /data/wikitext-test.parquet \
#     -er 50 \
#     -el 2048 \
#     -tkm 10

# Layer-swap analysis: keep first 5 layers from model A
# python model_diff.py \
#     -ma /models/llama-7b \
#     -mb /models/llama-7b-4bit-exl2 \
#     -ed /data/wikitext-test.parquet \
#     -k 5

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment