Principle:Ggml org Llama cpp Result Analysis

Aspect	Detail
Principle Name	Result Analysis
Domain	Model Perplexity Evaluation
Scope	Interpreting evaluation results: perplexity scores, accuracy with confidence intervals, KL divergence
Related Workflow	Model_Perplexity_Evaluation

Overview

Description

After evaluation is complete, the results must be properly analyzed and interpreted to draw meaningful conclusions about model quality. This involves understanding perplexity values and their uncertainty bounds, interpreting accuracy scores with confidence intervals for benchmark tasks, and analyzing KL divergence statistics for quantization quality assessment. Proper result analysis is essential for making informed decisions about model selection, quantization tradeoffs, and deployment readiness.

Usage

Result analysis is the final step in the evaluation workflow. The perplexity tool outputs results to stdout during and after computation. The user interprets these results to compare models, assess quantization quality, or validate model behavior.

Theoretical Basis

Perplexity Score Interpretation

Perplexity value:

The final perplexity is computed as:

PPL = exp(mean_NLL)

where mean_NLL = sum(NLL_i) / N is the average negative log-likelihood across all evaluated tokens. A perplexity of P means the model is, on average, as uncertain as if it were choosing uniformly among P options at each step.

Typical ranges:

PPL < 5: Excellent model quality (large models on domain-matched text)
PPL 5-10: Good quality (typical for 7B-13B parameter models on WikiText-2)
PPL 10-20: Acceptable (smaller models or aggressive quantization)
PPL > 20: Poor quality (very small models, extreme quantization, or domain mismatch)

Uncertainty estimation:

The perplexity tool reports uncertainty as PPL +/- sigma_PPL, computed via:

sigma_NLL = sqrt(var(NLL) / (N-1))

sigma_PPL = PPL * sigma_NLL

This uses error propagation through the exponential function. The uncertainty decreases with more evaluation tokens (larger datasets or more chunks).

HellaSwag and Winogrande Accuracy

Accuracy metric:

For benchmark evaluations, accuracy is the fraction of tasks where the model's top prediction matches the gold label:

acc = n_correct / n_total

Wilson score confidence interval:

Rather than the simpler Wald interval, llama.cpp uses the Wilson score interval for 95% confidence, which provides more accurate coverage for small sample sizes and probabilities near 0 or 1:

z = 1.96 (z-score for 95% confidence)

z_adj = z^2 / n

cnf = z_adj * sqrt(n * (4p(1-p) + z_adj)) / (2z)

lower = (p + z_adj/2 - cnf) / (1 + z_adj)

upper = (p + z_adj/2 + cnf) / (1 + z_adj)

The confidence interval narrows as more tasks are evaluated, eventually converging to a tight bound around the true accuracy.

KL Divergence Analysis

KL divergence quantifies how much the output distribution of a quantized model diverges from a reference (typically FP16) model. The analysis provides multiple complementary statistics:

Perplexity statistics:

Mean PPL(Q): Perplexity of the quantized model, with uncertainty
Mean PPL(base): Perplexity of the reference model, with uncertainty
Cor(ln(PPL(Q)), ln(PPL(base))): Correlation between the two models' per-chunk log-perplexities, typically very high (>99%) for modest quantization
Mean ln(PPL(Q)/PPL(base)): Log-ratio of perplexities, indicating relative quality loss
Mean PPL(Q)-PPL(base): Absolute perplexity difference

KL divergence statistics:

Mean KLD: Average KL divergence across all tokens, measuring information loss from quantization
Percentile distribution: From minimum through 0.1%, 1%, 5%, 10%, median, 90%, 95%, 99%, 99.9%, to maximum KLD, revealing the distribution of per-token divergence
Median KLD: More robust than mean to outliers

Token probability statistics:

Mean delta-p: Average change in the probability assigned to the correct next token
RMS delta-p: Root mean square of probability changes, capturing both positive and negative shifts
Same top p: Percentage of tokens where the quantized and reference models agree on the most likely next token
Percentile distribution: Detailed distribution of per-token probability changes

These statistics together paint a comprehensive picture of quantization quality. For example, a model might have a small mean KLD but a large 99.9th percentile KLD, indicating that quantization occasionally causes severe prediction errors even if the average case is good.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment