Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Perplexity Result Analysis

From Leeroopedia
Aspect Detail
Implementation Name Perplexity Result Analysis
Doc Type Pattern Doc
Domain Model Perplexity Evaluation
Purpose Result computation and reporting patterns for perplexity, HellaSwag accuracy, and KL divergence statistics
Related Workflow Model_Perplexity_Evaluation

Overview

Description

This implementation documents the result computation and reporting patterns in the llama-perplexity tool. Three distinct result analysis patterns are covered:

  • Perplexity final estimate: Computes the exponential of average NLL with uncertainty bounds
  • HellaSwag confidence intervals: Reports accuracy with Wilson score 95% confidence intervals
  • KL divergence statistics: Comprehensive statistical analysis of distribution divergence between quantized and reference models

Usage

Results are computed inline during the evaluation loop and printed to stdout. The perplexity final estimate is printed after all chunks have been processed. HellaSwag and Winogrande print running accuracy after each task. KL divergence prints per-chunk statistics during computation and comprehensive summary statistics at the end.

Code Reference

Aspect Detail
Source Location (PPL final) tools/perplexity/perplexity.cpp:645-654
Source Location (HellaSwag CI) tools/perplexity/perplexity.cpp:990-1003
Source Location (KL divergence) tools/perplexity/perplexity.cpp:1840-1975
Import #include "common.h", #include "llama.h"

Perplexity final estimate (perplexity.cpp:645-654):

nll2 /= count;
nll /= count;
const double ppl = exp(nll);
nll2 -= nll * nll;
if (nll2 > 0) {
    nll2 = sqrt(nll2/(count-1));
    LOG_INF("Final estimate: PPL = %.4lf +/- %.5lf\n", ppl, nll2*ppl);
} else {
    LOG_ERR("Unexpected negative standard deviation of log(prob)\n");
}

The uncertainty nll2*ppl is the standard deviation of perplexity, computed by error propagation: sigma_PPL = PPL * sigma_NLL, where sigma_NLL = sqrt(var(NLL) / (N-1)).

HellaSwag Wilson score confidence interval (perplexity.cpp:990-1003):

double freq = acc / double(i + 1);

const double za = 1.95996398454;  // z-score for 95% confidence

// Wilson score interval, more accurate than Wald interval
double z   = za * za / double(i + 1);
double cnf = z * sqrt(double(i + 1) * (4.0 * freq * (1 - freq) + z)) / (za + za);
double a   = (freq + z * 0.5 - cnf) / (1.0 + z);
double b   = (freq + z * 0.5 + cnf) / (1.0 + z);

// Print the accumulated accuracy mean x 100 and confidence interval
LOG("%zu\t%3.8lf%%\t[%3.4lf%%, %3.4lf%%]\n", i + 1, freq * 100.0, a * 100.0, b * 100.0);

KL divergence per-chunk reporting (perplexity.cpp:1849-1874):

LOG("chunk             PPL               ln(PPL(Q)/PPL(base))          "
    "KL Divergence              Dp RMS            Same top p\n");

// Per-chunk statistics
auto log_ppl = mean_and_uncertainty(kld.sum_nll, kld.sum_nll2, kld.count);
const double ppl_val = exp(log_ppl.first);
const double ppl_unc = ppl_val * log_ppl.second;
LOG("    %9.4lf +/- %9.4lf", ppl_val, ppl_unc);

auto log_ppl_base = mean_and_uncertainty(kld.sum_nll_base, kld.sum_nll_base2, kld.count);
const double log_ppl_ratio_val = log_ppl.first - log_ppl_base.first;
// ...

auto kl_div = mean_and_uncertainty(kld.sum_kld, kld.sum_kld2, kld.count);
LOG("    %10.5lf +/- %10.5lf", kl_div.first, kl_div.second);

const double p_diff_rms_val = sqrt(p_diff_mse.first);
LOG("    %6.3lf +/- %6.3lf %%", 100.0*p_diff_rms_val, 100.0*p_diff_rms_unc);

double p_top_val = 1.*kld.n_same_top/kld.count;
LOG("    %6.3lf +/- %6.3lf %%", 100.0*p_top_val, 100.0*p_top_unc);

KL divergence summary statistics (perplexity.cpp:1885-1975):

LOG("====== Perplexity statistics ======\n");
LOG("Mean PPL(Q)                   : %10.6lf +/- %10.6lf\n", ppl_val, ppl_unc);
LOG("Mean PPL(base)                : %10.6lf +/- %10.6lf\n", ppl_base_val, ppl_base_unc);
LOG("Cor(ln(PPL(Q)), ln(PPL(base))): %6.2lf%%\n", 100.0*log_ppl_cor);
LOG("Mean ln(PPL(Q)/PPL(base))     : %10.6lf +/- %10.6lf\n", log_ppl_ratio_val, log_ppl_ratio_unc);
LOG("Mean PPL(Q)/PPL(base)         : %10.6lf +/- %10.6lf\n", ppl_ratio_val, ppl_ratio_unc);
LOG("Mean PPL(Q)-PPL(base)         : %10.6lf +/- %10.6lf\n", ppl_diff_val, ppl_diff_unc);

LOG("====== KL divergence statistics ======\n");
LOG("Mean    KLD: %10.6lf +/- %10.6lf\n", kl_div.first, kl_div.second);
LOG("Maximum KLD: %10.6f\n", kld_values.back());
LOG("99.9%%   KLD: %10.6f\n", percentile(kld_values, 0.999f));
LOG("99.0%%   KLD: %10.6f\n", percentile(kld_values, 0.990f));
LOG("95.0%%   KLD: %10.6f\n", percentile(kld_values, 0.950f));
// ... additional percentiles ...
LOG("Minimum KLD: %10.6f\n", kld_values.front());

LOG("====== Token probability statistics ======\n");
LOG("Mean    Dp: %6.3lf +/- %5.3lf %%\n", 100.0*p_diff.first, 100.0*p_diff.second);
// ... percentile distribution ...
LOG("RMS Dp    : %6.3lf +/- %5.3lf %%\n", 100.0*p_diff_rms_val, 100.0*p_diff_rms_unc);
LOG("Same top p: %6.3lf +/- %5.3lf %%\n", 100.0*same_top_p, 100.0*same_top_unc);

I/O Contract

Perplexity final estimate:

Direction Name Type Description
Input nll double Accumulated negative log-likelihood sum
Input nll2 double Accumulated squared NLL sum (for variance)
Input count int Total number of evaluated tokens
Output ppl double Final perplexity value: exp(nll/count)
Output uncertainty double Standard deviation: ppl * sqrt(var/(count-1))

HellaSwag confidence interval:

Direction Name Type Description
Input acc double Accumulated correct predictions count
Input i size_t Number of tasks evaluated so far
Output freq double Accuracy ratio acc/(i+1)
Output [a, b] double, double Wilson score 95% confidence interval bounds

KL divergence statistics:

Direction Name Type Description
Input kld struct Accumulator with sum_nll, sum_nll2, sum_nll_base, sum_kld, sum_p_diff, n_same_top, count
Input kld_values vector<float> Per-token KL divergence values (sorted for percentiles)
Input p_diff_values vector<float> Per-token probability difference values (sorted for percentiles)
Output (stdout) text Three sections: perplexity statistics, KL divergence statistics, token probability statistics

Usage Examples

Example 1: Interpreting standard perplexity output

perplexity: calculating perplexity over 114 chunks, n_ctx=512, batch_size=512, n_seq=4
[1]8.2134,[2]7.9872,[3]7.6543,...,[114]6.3215
Final estimate: PPL = 6.3215 +/- 0.04821

This indicates:

  • The model achieves perplexity 6.32 on WikiText-2
  • The uncertainty is small (+/- 0.048), indicating sufficient evaluation data
  • Lower PPL is better; compare across quantization levels of the same model

Example 2: Interpreting HellaSwag output

task    acc_norm    95% confidence interval
1       100.00000000%   [20.2434%, 100.0000%]
100     76.00000000%    [66.8044%, 83.3548%]
10042   76.12345678%    [75.2891%, 76.9578%]

This indicates:

  • After 10042 tasks, accuracy is 76.12%
  • 95% confidence interval is [75.29%, 76.96%], a tight bound
  • The wide interval at task 1 demonstrates why many tasks are needed

Example 3: Interpreting KL divergence summary

====== Perplexity statistics ======
Mean PPL(Q)                   :   6.543210 +/-   0.048210
Mean PPL(base)                :   6.321540 +/-   0.045230
Mean PPL(Q)-PPL(base)         :   0.221670 +/-   0.012340

====== KL divergence statistics ======
Mean    KLD:   0.003421 +/-   0.000234
Maximum KLD:   2.345678
99.9%   KLD:   0.567890
Median  KLD:   0.001234

====== Token probability statistics ======
RMS Dp    :  1.234 +/- 0.056 %
Same top p: 98.765 +/- 0.123 %

This indicates:

  • Quantized model PPL is 0.22 higher than reference (small degradation)
  • Mean KL divergence is very low (0.003), but maximum is 2.35 (some tokens are severely affected)
  • 98.8% of tokens have the same top prediction, indicating high fidelity
  • RMS probability difference is only 1.2%, suggesting minimal practical impact

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment