Implementation:Ggml org Llama cpp Perplexity Evaluation

Field	Value
Implementation Name	Perplexity Evaluation
Doc Type	Wrapper Doc
Topic	Model Quantization
Workflow	Model_Quantization
Category	Quality Evaluation
Repository	Ggml_org_Llama_cpp

Overview

Description

The perplexity evaluation module implements two primary model quality assessment functions: perplexity() for measuring language modeling quality on raw text corpora, and hellaswag_score() for evaluating common-sense reasoning accuracy. These functions are used after quantization to validate that the quantized model maintains acceptable quality relative to the full-precision baseline.

The perplexity() function tokenizes a test corpus, processes it in context-sized chunks with batched decoding, and computes the exponential of the average negative log-likelihood. The hellaswag_score() function evaluates the model on the HellaSwag benchmark by comparing log-probabilities of candidate continuations and reporting accuracy with Wilson score confidence intervals.

Usage

Both functions are called by the llama-perplexity command-line tool, which loads a model and test data, then dispatches to the appropriate evaluation function based on the command-line arguments.

Code Reference

Source Location

perplexity function: tools/perplexity/perplexity.cpp (lines 441-659)
hellaswag_score function: tools/perplexity/perplexity.cpp (lines 741-1012)
results struct: tools/perplexity/perplexity.cpp (lines 25-30)

Signature

struct results_perplexity {
    std::vector<llama_token> tokens;
    double                   ppl_value;
    std::vector<float>       logits;
    std::vector<float>       probs;
};

// Compute perplexity over a text corpus
static results_perplexity perplexity(
    llama_context * ctx,
    const common_params & params,
    const int32_t n_ctx);

// Compute HellaSwag accuracy score
static void hellaswag_score(
    llama_context * ctx,
    const common_params & params);

Import

#include "common.h"
#include "llama.h"

I/O Contract

perplexity()

Direction	Type	Description
Input (ctx)	`llama_context *`	Initialized llama context with a loaded model
Input (params)	`const common_params &`	Runtime parameters including: `prompt` (the test text), `n_ctx` (context size), `n_batch` (batch size), `n_chunks` (max chunks to evaluate, -1 for all), `ppl_stride` (stride for v2 mode), `ppl_output_type` (output format), `logits_file` (optional logits dump path)
Input (n_ctx)	`int32_t`	Context window size used for chunking the test corpus
Output	`results_perplexity`	Struct containing: tokenized input, computed perplexity value, per-token logit history, per-token probability history
Side Effect	stdout	Prints per-chunk perplexity progress and final estimate with confidence interval

Processing steps:

Tokenizes the input text with BOS token handling
Validates that the corpus has at least 2 * n_ctx tokens
Divides tokens into n_ctx-sized chunks
For each chunk: clears KV cache, decodes in batches, collects logits for the second half of the context window
Computes per-token negative log-likelihood using softmax over vocabulary logits
Accumulates total NLL and NLL-squared for mean and standard deviation
Returns exp(mean_nll) as the perplexity value with confidence interval +/- stddev * ppl

hellaswag_score()

Direction	Type	Description
Input (ctx)	`llama_context *`	Initialized llama context with a loaded model
Input (params)	`const common_params &`	Runtime parameters including: `prompt` (HellaSwag dataset in 6-lines-per-task format), `hellaswag_tasks` (number of tasks to evaluate), `n_batch` (batch size)
Output	void	Results printed to stdout
Side Effect	stdout	Prints per-task accuracy with Wilson score 95% confidence intervals

Processing steps:

Parses the prompt into HellaSwag tasks (6 lines per task: context, gold label, 4 endings)
Optionally randomizes task order (deterministic seed = 1)
Tokenizes each task's context + ending combinations, computing common prefixes
Batches multiple tasks into a single decode pass, sharing common prefix tokens across 4 sequence IDs
Computes normalized log-probabilities for each ending
Selects the ending with the highest average log-probability as the prediction
Tracks accuracy and computes Wilson score confidence intervals

Usage Examples

Example 1: Measure perplexity on WikiText-2

# Download WikiText-2 test set
# Run perplexity evaluation
./llama-perplexity \
    -m model-q4_k_m.gguf \
    -f wikitext-2-raw/wiki.test.raw \
    -c 512 \
    --chunks 100

Expected output:

perplexity: calculating perplexity over 100 chunks, n_ctx=512, batch_size=2048, n_seq=4
[1]6.8234,[2]7.1023,...,[100]6.9456
Final estimate: PPL = 6.9456 +/- 0.03211

Example 2: Compare quantized vs baseline perplexity

# Baseline (F16)
./llama-perplexity -m model-f16.gguf -f wiki.test.raw
# Output: PPL = 6.2100

# Quantized (Q4_K_M)
./llama-perplexity -m model-q4_k_m.gguf -f wiki.test.raw
# Output: PPL = 6.3854
# Delta: +0.1754

Example 3: HellaSwag evaluation

./llama-perplexity \
    -m model-q4_k_m.gguf \
    --hellaswag \
    --hellaswag-tasks 400 \
    -f hellaswag_val_full.txt

Expected output:

task    acc_norm    95% confidence interval
1       100.00000000%   [20.2442%, 100.0000%]
2       50.00000000%    [14.5765%, 85.4235%]
...
400     78.50000000%    [74.1234%, 82.3456%]

Example 4: Save logits for detailed analysis

./llama-perplexity \
    -m model-q4_k_m.gguf \
    -f wiki.test.raw \
    --logits-file logits-q4km.bin

Related Pages

Principle:Ggml_org_Llama_cpp_Quantization_Validation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment