Principle:Ggml org Llama cpp Perplexity Computation

Aspect	Detail
Principle Name	Perplexity Computation
Domain	Model Perplexity Evaluation
Scope	Theory of perplexity as a model quality metric: cross-entropy, negative log-likelihood, HellaSwag, and Winogrande evaluation methodologies
Related Workflow	Model_Perplexity_Evaluation
Core	Yes

Overview

Description

Perplexity is the primary metric for evaluating language model quality in llama.cpp. It quantifies how well a model predicts the next token in a sequence of natural language text. Lower perplexity indicates that the model assigns higher probability to the correct next token on average. In addition to standard perplexity, llama.cpp implements HellaSwag and Winogrande benchmarks that evaluate commonsense reasoning and coreference resolution through normalized log-probability scoring.

Usage

Perplexity computation is the core evaluation step that runs after the model has been loaded and the dataset has been read. The computation processes the dataset in fixed-size chunks, computing the model's predictions for each token and accumulating the negative log-likelihood. The final perplexity is the exponential of the average negative log-likelihood.

Theoretical Basis

Standard Perplexity

Definition:

Perplexity (PPL) is defined as the exponential of the average negative log-likelihood (NLL) of the model's predictions:

PPL = exp(NLL) = exp(-1/N * sum(log P(w_i | w_<i)))

where:

N is the total number of predicted tokens
w_i is the i-th token in the sequence
w_<i is the sequence of all tokens preceding w_i
P(w_i | w_<i) is the model's predicted probability of token w_i given its context

Chunked Computation:

Since the evaluation dataset is much longer than the model's context window, the data is divided into non-overlapping chunks of size n_ctx. For each chunk, the model processes the entire window but only the second half of the window contributes to the perplexity calculation. This sliding-window approach (following the Hugging Face perplexity methodology) ensures that every evaluated token has at least n_ctx/2 tokens of prior context, preventing the artificially high perplexity that would result from predicting the first tokens in a window with no context.

For a context window of 512 tokens, the model evaluates tokens at positions 256 through 510 (the last 255 tokens), giving each predicted token between 256 and 511 tokens of context.

BOS Token Handling:

A beginning-of-sequence (BOS) token is prepended to each chunk if the model's vocabulary requires it (checked via llama_vocab_get_add_bos()). The original first token is temporarily replaced with the BOS token and restored after decoding.

Parallel Sequence Processing:

For efficiency, multiple chunks can be processed simultaneously using parallel sequences. With batch size B and context size C, the number of parallel sequences is n_seq = B / C. Each sequence gets its own sequence ID in the KV cache, allowing independent evaluation.

Uncertainty Estimation:

The standard deviation of perplexity is estimated using the variance of the per-token negative log-likelihoods:

sigma_NLL = sqrt(var(NLL) / (N-1))

sigma_PPL = PPL * sigma_NLL (via error propagation through the exponential)

HellaSwag Evaluation

Methodology:

HellaSwag evaluates commonsense reasoning through a 4-way multiple choice format. For each task:

The context and four possible endings are each tokenized
The common prefix shared by all four continuations is identified to save computation
All four sequences are evaluated in parallel using shared KV cache for the common prefix
For each ending, the normalized log-probability (average log-prob per token) is computed
The ending with the highest normalized log-probability is selected as the model's answer
If the selected ending matches the gold label, it counts as a correct answer

Accuracy with Confidence Intervals:

The accuracy is reported as a percentage along with a 95% Wilson score confidence interval, which is more accurate than the Wald interval for small sample sizes or extreme probabilities:

z = 1.96 (for 95% confidence)

z_adj = z^2 / n

cnf = z_adj * sqrt(n * (4 * p * (1-p) + z_adj)) / (2 * z)

CI = [(p + z_adj/2 - cnf) / (1 + z_adj), (p + z_adj/2 + cnf) / (1 + z_adj)]

Batched Processing:

Multiple HellaSwag tasks are batched together to maximize GPU utilization. Each task uses 4 sequence IDs (one per ending), and as many tasks as possible are packed into the available context window. The common prefix sharing reduces the total token count per task.

Winogrande Evaluation

Methodology:

Winogrande tasks present a sentence with a blank and two possible fill-in options. The evaluation:

Tokenizes both options within their sentence context
Identifies the common prefix between the two sequences
Evaluates both sequences using shared KV cache for the common prefix
Compares the log-probabilities of the differing portions
Selects the option with higher probability

The scoring uses the same batched evaluation approach as HellaSwag, with 2 sequences per task instead of 4.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment