Implementation:Ggml org Llama cpp Perplexity Result Analysis
| Aspect | Detail |
|---|---|
| Implementation Name | Perplexity Result Analysis |
| Doc Type | Pattern Doc |
| Domain | Model Perplexity Evaluation |
| Purpose | Result computation and reporting patterns for perplexity, HellaSwag accuracy, and KL divergence statistics |
| Related Workflow | Model_Perplexity_Evaluation |
Overview
Description
This implementation documents the result computation and reporting patterns in the llama-perplexity tool. Three distinct result analysis patterns are covered:
- Perplexity final estimate: Computes the exponential of average NLL with uncertainty bounds
- HellaSwag confidence intervals: Reports accuracy with Wilson score 95% confidence intervals
- KL divergence statistics: Comprehensive statistical analysis of distribution divergence between quantized and reference models
Usage
Results are computed inline during the evaluation loop and printed to stdout. The perplexity final estimate is printed after all chunks have been processed. HellaSwag and Winogrande print running accuracy after each task. KL divergence prints per-chunk statistics during computation and comprehensive summary statistics at the end.
Code Reference
| Aspect | Detail |
|---|---|
| Source Location (PPL final) | tools/perplexity/perplexity.cpp:645-654
|
| Source Location (HellaSwag CI) | tools/perplexity/perplexity.cpp:990-1003
|
| Source Location (KL divergence) | tools/perplexity/perplexity.cpp:1840-1975
|
| Import | #include "common.h", #include "llama.h"
|
Perplexity final estimate (perplexity.cpp:645-654):
nll2 /= count;
nll /= count;
const double ppl = exp(nll);
nll2 -= nll * nll;
if (nll2 > 0) {
nll2 = sqrt(nll2/(count-1));
LOG_INF("Final estimate: PPL = %.4lf +/- %.5lf\n", ppl, nll2*ppl);
} else {
LOG_ERR("Unexpected negative standard deviation of log(prob)\n");
}
The uncertainty nll2*ppl is the standard deviation of perplexity, computed by error propagation: sigma_PPL = PPL * sigma_NLL, where sigma_NLL = sqrt(var(NLL) / (N-1)).
HellaSwag Wilson score confidence interval (perplexity.cpp:990-1003):
double freq = acc / double(i + 1);
const double za = 1.95996398454; // z-score for 95% confidence
// Wilson score interval, more accurate than Wald interval
double z = za * za / double(i + 1);
double cnf = z * sqrt(double(i + 1) * (4.0 * freq * (1 - freq) + z)) / (za + za);
double a = (freq + z * 0.5 - cnf) / (1.0 + z);
double b = (freq + z * 0.5 + cnf) / (1.0 + z);
// Print the accumulated accuracy mean x 100 and confidence interval
LOG("%zu\t%3.8lf%%\t[%3.4lf%%, %3.4lf%%]\n", i + 1, freq * 100.0, a * 100.0, b * 100.0);
KL divergence per-chunk reporting (perplexity.cpp:1849-1874):
LOG("chunk PPL ln(PPL(Q)/PPL(base)) "
"KL Divergence Dp RMS Same top p\n");
// Per-chunk statistics
auto log_ppl = mean_and_uncertainty(kld.sum_nll, kld.sum_nll2, kld.count);
const double ppl_val = exp(log_ppl.first);
const double ppl_unc = ppl_val * log_ppl.second;
LOG(" %9.4lf +/- %9.4lf", ppl_val, ppl_unc);
auto log_ppl_base = mean_and_uncertainty(kld.sum_nll_base, kld.sum_nll_base2, kld.count);
const double log_ppl_ratio_val = log_ppl.first - log_ppl_base.first;
// ...
auto kl_div = mean_and_uncertainty(kld.sum_kld, kld.sum_kld2, kld.count);
LOG(" %10.5lf +/- %10.5lf", kl_div.first, kl_div.second);
const double p_diff_rms_val = sqrt(p_diff_mse.first);
LOG(" %6.3lf +/- %6.3lf %%", 100.0*p_diff_rms_val, 100.0*p_diff_rms_unc);
double p_top_val = 1.*kld.n_same_top/kld.count;
LOG(" %6.3lf +/- %6.3lf %%", 100.0*p_top_val, 100.0*p_top_unc);
KL divergence summary statistics (perplexity.cpp:1885-1975):
LOG("====== Perplexity statistics ======\n");
LOG("Mean PPL(Q) : %10.6lf +/- %10.6lf\n", ppl_val, ppl_unc);
LOG("Mean PPL(base) : %10.6lf +/- %10.6lf\n", ppl_base_val, ppl_base_unc);
LOG("Cor(ln(PPL(Q)), ln(PPL(base))): %6.2lf%%\n", 100.0*log_ppl_cor);
LOG("Mean ln(PPL(Q)/PPL(base)) : %10.6lf +/- %10.6lf\n", log_ppl_ratio_val, log_ppl_ratio_unc);
LOG("Mean PPL(Q)/PPL(base) : %10.6lf +/- %10.6lf\n", ppl_ratio_val, ppl_ratio_unc);
LOG("Mean PPL(Q)-PPL(base) : %10.6lf +/- %10.6lf\n", ppl_diff_val, ppl_diff_unc);
LOG("====== KL divergence statistics ======\n");
LOG("Mean KLD: %10.6lf +/- %10.6lf\n", kl_div.first, kl_div.second);
LOG("Maximum KLD: %10.6f\n", kld_values.back());
LOG("99.9%% KLD: %10.6f\n", percentile(kld_values, 0.999f));
LOG("99.0%% KLD: %10.6f\n", percentile(kld_values, 0.990f));
LOG("95.0%% KLD: %10.6f\n", percentile(kld_values, 0.950f));
// ... additional percentiles ...
LOG("Minimum KLD: %10.6f\n", kld_values.front());
LOG("====== Token probability statistics ======\n");
LOG("Mean Dp: %6.3lf +/- %5.3lf %%\n", 100.0*p_diff.first, 100.0*p_diff.second);
// ... percentile distribution ...
LOG("RMS Dp : %6.3lf +/- %5.3lf %%\n", 100.0*p_diff_rms_val, 100.0*p_diff_rms_unc);
LOG("Same top p: %6.3lf +/- %5.3lf %%\n", 100.0*same_top_p, 100.0*same_top_unc);
I/O Contract
Perplexity final estimate:
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | nll | double |
Accumulated negative log-likelihood sum |
| Input | nll2 | double |
Accumulated squared NLL sum (for variance) |
| Input | count | int |
Total number of evaluated tokens |
| Output | ppl | double |
Final perplexity value: exp(nll/count)
|
| Output | uncertainty | double |
Standard deviation: ppl * sqrt(var/(count-1))
|
HellaSwag confidence interval:
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | acc | double |
Accumulated correct predictions count |
| Input | i | size_t |
Number of tasks evaluated so far |
| Output | freq | double |
Accuracy ratio acc/(i+1)
|
| Output | [a, b] | double, double |
Wilson score 95% confidence interval bounds |
KL divergence statistics:
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | kld | struct |
Accumulator with sum_nll, sum_nll2, sum_nll_base, sum_kld, sum_p_diff, n_same_top, count
|
| Input | kld_values | vector<float> |
Per-token KL divergence values (sorted for percentiles) |
| Input | p_diff_values | vector<float> |
Per-token probability difference values (sorted for percentiles) |
| Output | (stdout) | text | Three sections: perplexity statistics, KL divergence statistics, token probability statistics |
Usage Examples
Example 1: Interpreting standard perplexity output
perplexity: calculating perplexity over 114 chunks, n_ctx=512, batch_size=512, n_seq=4
[1]8.2134,[2]7.9872,[3]7.6543,...,[114]6.3215
Final estimate: PPL = 6.3215 +/- 0.04821
This indicates:
- The model achieves perplexity 6.32 on WikiText-2
- The uncertainty is small (+/- 0.048), indicating sufficient evaluation data
- Lower PPL is better; compare across quantization levels of the same model
Example 2: Interpreting HellaSwag output
task acc_norm 95% confidence interval
1 100.00000000% [20.2434%, 100.0000%]
100 76.00000000% [66.8044%, 83.3548%]
10042 76.12345678% [75.2891%, 76.9578%]
This indicates:
- After 10042 tasks, accuracy is 76.12%
- 95% confidence interval is [75.29%, 76.96%], a tight bound
- The wide interval at task 1 demonstrates why many tasks are needed
Example 3: Interpreting KL divergence summary
====== Perplexity statistics ======
Mean PPL(Q) : 6.543210 +/- 0.048210
Mean PPL(base) : 6.321540 +/- 0.045230
Mean PPL(Q)-PPL(base) : 0.221670 +/- 0.012340
====== KL divergence statistics ======
Mean KLD: 0.003421 +/- 0.000234
Maximum KLD: 2.345678
99.9% KLD: 0.567890
Median KLD: 0.001234
====== Token probability statistics ======
RMS Dp : 1.234 +/- 0.056 %
Same top p: 98.765 +/- 0.123 %
This indicates:
- Quantized model PPL is 0.22 higher than reference (small degradation)
- Mean KL divergence is very low (0.003), but maximum is 2.35 (some tokens are severely affected)
- 98.8% of tokens have the same top prediction, indicating high fidelity
- RMS probability difference is only 1.2%, suggesting minimal practical impact