Implementation:Ggml org Llama cpp Perplexity Result Analysis

Aspect	Detail
Implementation Name	Perplexity Result Analysis
Doc Type	Pattern Doc
Domain	Model Perplexity Evaluation
Purpose	Result computation and reporting patterns for perplexity, HellaSwag accuracy, and KL divergence statistics
Related Workflow	Model_Perplexity_Evaluation

Overview

Description

This implementation documents the result computation and reporting patterns in the llama-perplexity tool. Three distinct result analysis patterns are covered:

Perplexity final estimate: Computes the exponential of average NLL with uncertainty bounds
HellaSwag confidence intervals: Reports accuracy with Wilson score 95% confidence intervals
KL divergence statistics: Comprehensive statistical analysis of distribution divergence between quantized and reference models

Usage

Results are computed inline during the evaluation loop and printed to stdout. The perplexity final estimate is printed after all chunks have been processed. HellaSwag and Winogrande print running accuracy after each task. KL divergence prints per-chunk statistics during computation and comprehensive summary statistics at the end.

Code Reference

Aspect	Detail
Source Location (PPL final)	`tools/perplexity/perplexity.cpp:645-654`
Source Location (HellaSwag CI)	`tools/perplexity/perplexity.cpp:990-1003`
Source Location (KL divergence)	`tools/perplexity/perplexity.cpp:1840-1975`
Import	`#include "common.h"`, `#include "llama.h"`

Perplexity final estimate (perplexity.cpp:645-654):

nll2 /= count;
nll /= count;
const double ppl = exp(nll);
nll2 -= nll * nll;
if (nll2 > 0) {
    nll2 = sqrt(nll2/(count-1));
    LOG_INF("Final estimate: PPL = %.4lf +/- %.5lf\n", ppl, nll2*ppl);
} else {
    LOG_ERR("Unexpected negative standard deviation of log(prob)\n");
}

The uncertainty nll2*ppl is the standard deviation of perplexity, computed by error propagation: sigma_PPL = PPL * sigma_NLL, where sigma_NLL = sqrt(var(NLL) / (N-1)).

HellaSwag Wilson score confidence interval (perplexity.cpp:990-1003):

double freq = acc / double(i + 1);

const double za = 1.95996398454;  // z-score for 95% confidence

// Wilson score interval, more accurate than Wald interval
double z   = za * za / double(i + 1);
double cnf = z * sqrt(double(i + 1) * (4.0 * freq * (1 - freq) + z)) / (za + za);
double a   = (freq + z * 0.5 - cnf) / (1.0 + z);
double b   = (freq + z * 0.5 + cnf) / (1.0 + z);

// Print the accumulated accuracy mean x 100 and confidence interval
LOG("%zu\t%3.8lf%%\t[%3.4lf%%, %3.4lf%%]\n", i + 1, freq * 100.0, a * 100.0, b * 100.0);

KL divergence per-chunk reporting (perplexity.cpp:1849-1874):

LOG("chunk             PPL               ln(PPL(Q)/PPL(base))          "
    "KL Divergence              Dp RMS            Same top p\n");

// Per-chunk statistics
auto log_ppl = mean_and_uncertainty(kld.sum_nll, kld.sum_nll2, kld.count);
const double ppl_val = exp(log_ppl.first);
const double ppl_unc = ppl_val * log_ppl.second;
LOG("    %9.4lf +/- %9.4lf", ppl_val, ppl_unc);

auto log_ppl_base = mean_and_uncertainty(kld.sum_nll_base, kld.sum_nll_base2, kld.count);
const double log_ppl_ratio_val = log_ppl.first - log_ppl_base.first;
// ...

auto kl_div = mean_and_uncertainty(kld.sum_kld, kld.sum_kld2, kld.count);
LOG("    %10.5lf +/- %10.5lf", kl_div.first, kl_div.second);

const double p_diff_rms_val = sqrt(p_diff_mse.first);
LOG("    %6.3lf +/- %6.3lf %%", 100.0*p_diff_rms_val, 100.0*p_diff_rms_unc);

double p_top_val = 1.*kld.n_same_top/kld.count;
LOG("    %6.3lf +/- %6.3lf %%", 100.0*p_top_val, 100.0*p_top_unc);

KL divergence summary statistics (perplexity.cpp:1885-1975):

LOG("====== Perplexity statistics ======\n");
LOG("Mean PPL(Q)                   : %10.6lf +/- %10.6lf\n", ppl_val, ppl_unc);
LOG("Mean PPL(base)                : %10.6lf +/- %10.6lf\n", ppl_base_val, ppl_base_unc);
LOG("Cor(ln(PPL(Q)), ln(PPL(base))): %6.2lf%%\n", 100.0*log_ppl_cor);
LOG("Mean ln(PPL(Q)/PPL(base))     : %10.6lf +/- %10.6lf\n", log_ppl_ratio_val, log_ppl_ratio_unc);
LOG("Mean PPL(Q)/PPL(base)         : %10.6lf +/- %10.6lf\n", ppl_ratio_val, ppl_ratio_unc);
LOG("Mean PPL(Q)-PPL(base)         : %10.6lf +/- %10.6lf\n", ppl_diff_val, ppl_diff_unc);

LOG("====== KL divergence statistics ======\n");
LOG("Mean    KLD: %10.6lf +/- %10.6lf\n", kl_div.first, kl_div.second);
LOG("Maximum KLD: %10.6f\n", kld_values.back());
LOG("99.9%%   KLD: %10.6f\n", percentile(kld_values, 0.999f));
LOG("99.0%%   KLD: %10.6f\n", percentile(kld_values, 0.990f));
LOG("95.0%%   KLD: %10.6f\n", percentile(kld_values, 0.950f));
// ... additional percentiles ...
LOG("Minimum KLD: %10.6f\n", kld_values.front());

LOG("====== Token probability statistics ======\n");
LOG("Mean    Dp: %6.3lf +/- %5.3lf %%\n", 100.0*p_diff.first, 100.0*p_diff.second);
// ... percentile distribution ...
LOG("RMS Dp    : %6.3lf +/- %5.3lf %%\n", 100.0*p_diff_rms_val, 100.0*p_diff_rms_unc);
LOG("Same top p: %6.3lf +/- %5.3lf %%\n", 100.0*same_top_p, 100.0*same_top_unc);

I/O Contract

Perplexity final estimate:

Direction	Name	Type	Description
Input	nll	`double`	Accumulated negative log-likelihood sum
Input	nll2	`double`	Accumulated squared NLL sum (for variance)
Input	count	`int`	Total number of evaluated tokens
Output	ppl	`double`	Final perplexity value: `exp(nll/count)`
Output	uncertainty	`double`	Standard deviation: `ppl * sqrt(var/(count-1))`

HellaSwag confidence interval:

Direction	Name	Type	Description
Input	acc	`double`	Accumulated correct predictions count
Input	i	`size_t`	Number of tasks evaluated so far
Output	freq	`double`	Accuracy ratio `acc/(i+1)`
Output	[a, b]	`double, double`	Wilson score 95% confidence interval bounds

KL divergence statistics:

Direction	Name	Type	Description
Input	kld	`struct`	Accumulator with `sum_nll`, `sum_nll2`, `sum_nll_base`, `sum_kld`, `sum_p_diff`, `n_same_top`, `count`
Input	kld_values	`vector<float>`	Per-token KL divergence values (sorted for percentiles)
Input	p_diff_values	`vector<float>`	Per-token probability difference values (sorted for percentiles)
Output	(stdout)	text	Three sections: perplexity statistics, KL divergence statistics, token probability statistics

Usage Examples

Example 1: Interpreting standard perplexity output

perplexity: calculating perplexity over 114 chunks, n_ctx=512, batch_size=512, n_seq=4
[1]8.2134,[2]7.9872,[3]7.6543,...,[114]6.3215
Final estimate: PPL = 6.3215 +/- 0.04821

This indicates:

The model achieves perplexity 6.32 on WikiText-2
The uncertainty is small (+/- 0.048), indicating sufficient evaluation data
Lower PPL is better; compare across quantization levels of the same model

Example 2: Interpreting HellaSwag output

task    acc_norm    95% confidence interval
1       100.00000000%   [20.2434%, 100.0000%]
100     76.00000000%    [66.8044%, 83.3548%]
10042   76.12345678%    [75.2891%, 76.9578%]

This indicates:

After 10042 tasks, accuracy is 76.12%
95% confidence interval is [75.29%, 76.96%], a tight bound
The wide interval at task 1 demonstrates why many tasks are needed

Example 3: Interpreting KL divergence summary

====== Perplexity statistics ======
Mean PPL(Q)                   :   6.543210 +/-   0.048210
Mean PPL(base)                :   6.321540 +/-   0.045230
Mean PPL(Q)-PPL(base)         :   0.221670 +/-   0.012340

====== KL divergence statistics ======
Mean    KLD:   0.003421 +/-   0.000234
Maximum KLD:   2.345678
99.9%   KLD:   0.567890
Median  KLD:   0.001234

====== Token probability statistics ======
RMS Dp    :  1.234 +/- 0.056 %
Same top p: 98.765 +/- 0.123 %

This indicates:

Quantized model PPL is 0.22 higher than reference (small degradation)
Mean KL divergence is very low (0.003), but maximum is 2.35 (some tokens are severely affected)
98.8% of tokens have the same top prediction, indicating high fidelity
RMS probability difference is only 1.2%, suggesting minimal practical impact

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment