Implementation:Ggml org Llama cpp Perplexity Evaluation
| Field | Value |
|---|---|
| Implementation Name | Perplexity Evaluation |
| Doc Type | Wrapper Doc |
| Topic | Model Quantization |
| Workflow | Model_Quantization |
| Category | Quality Evaluation |
| Repository | Ggml_org_Llama_cpp |
Overview
Description
The perplexity evaluation module implements two primary model quality assessment functions: perplexity() for measuring language modeling quality on raw text corpora, and hellaswag_score() for evaluating common-sense reasoning accuracy. These functions are used after quantization to validate that the quantized model maintains acceptable quality relative to the full-precision baseline.
The perplexity() function tokenizes a test corpus, processes it in context-sized chunks with batched decoding, and computes the exponential of the average negative log-likelihood. The hellaswag_score() function evaluates the model on the HellaSwag benchmark by comparing log-probabilities of candidate continuations and reporting accuracy with Wilson score confidence intervals.
Usage
Both functions are called by the llama-perplexity command-line tool, which loads a model and test data, then dispatches to the appropriate evaluation function based on the command-line arguments.
Code Reference
Source Location
- perplexity function:
tools/perplexity/perplexity.cpp(lines 441-659) - hellaswag_score function:
tools/perplexity/perplexity.cpp(lines 741-1012) - results struct:
tools/perplexity/perplexity.cpp(lines 25-30)
Signature
struct results_perplexity {
std::vector<llama_token> tokens;
double ppl_value;
std::vector<float> logits;
std::vector<float> probs;
};
// Compute perplexity over a text corpus
static results_perplexity perplexity(
llama_context * ctx,
const common_params & params,
const int32_t n_ctx);
// Compute HellaSwag accuracy score
static void hellaswag_score(
llama_context * ctx,
const common_params & params);
Import
#include "common.h"
#include "llama.h"
I/O Contract
perplexity()
| Direction | Type | Description |
|---|---|---|
| Input (ctx) | llama_context * |
Initialized llama context with a loaded model |
| Input (params) | const common_params & |
Runtime parameters including: prompt (the test text), n_ctx (context size), n_batch (batch size), n_chunks (max chunks to evaluate, -1 for all), ppl_stride (stride for v2 mode), ppl_output_type (output format), logits_file (optional logits dump path)
|
| Input (n_ctx) | int32_t |
Context window size used for chunking the test corpus |
| Output | results_perplexity |
Struct containing: tokenized input, computed perplexity value, per-token logit history, per-token probability history |
| Side Effect | stdout | Prints per-chunk perplexity progress and final estimate with confidence interval |
Processing steps:
- Tokenizes the input text with BOS token handling
- Validates that the corpus has at least 2 * n_ctx tokens
- Divides tokens into n_ctx-sized chunks
- For each chunk: clears KV cache, decodes in batches, collects logits for the second half of the context window
- Computes per-token negative log-likelihood using softmax over vocabulary logits
- Accumulates total NLL and NLL-squared for mean and standard deviation
- Returns
exp(mean_nll)as the perplexity value with confidence interval+/- stddev * ppl
hellaswag_score()
| Direction | Type | Description |
|---|---|---|
| Input (ctx) | llama_context * |
Initialized llama context with a loaded model |
| Input (params) | const common_params & |
Runtime parameters including: prompt (HellaSwag dataset in 6-lines-per-task format), hellaswag_tasks (number of tasks to evaluate), n_batch (batch size)
|
| Output | void | Results printed to stdout |
| Side Effect | stdout | Prints per-task accuracy with Wilson score 95% confidence intervals |
Processing steps:
- Parses the prompt into HellaSwag tasks (6 lines per task: context, gold label, 4 endings)
- Optionally randomizes task order (deterministic seed = 1)
- Tokenizes each task's context + ending combinations, computing common prefixes
- Batches multiple tasks into a single decode pass, sharing common prefix tokens across 4 sequence IDs
- Computes normalized log-probabilities for each ending
- Selects the ending with the highest average log-probability as the prediction
- Tracks accuracy and computes Wilson score confidence intervals
Usage Examples
Example 1: Measure perplexity on WikiText-2
# Download WikiText-2 test set
# Run perplexity evaluation
./llama-perplexity \
-m model-q4_k_m.gguf \
-f wikitext-2-raw/wiki.test.raw \
-c 512 \
--chunks 100
Expected output:
perplexity: calculating perplexity over 100 chunks, n_ctx=512, batch_size=2048, n_seq=4
[1]6.8234,[2]7.1023,...,[100]6.9456
Final estimate: PPL = 6.9456 +/- 0.03211
Example 2: Compare quantized vs baseline perplexity
# Baseline (F16)
./llama-perplexity -m model-f16.gguf -f wiki.test.raw
# Output: PPL = 6.2100
# Quantized (Q4_K_M)
./llama-perplexity -m model-q4_k_m.gguf -f wiki.test.raw
# Output: PPL = 6.3854
# Delta: +0.1754
Example 3: HellaSwag evaluation
./llama-perplexity \
-m model-q4_k_m.gguf \
--hellaswag \
--hellaswag-tasks 400 \
-f hellaswag_val_full.txt
Expected output:
task acc_norm 95% confidence interval
1 100.00000000% [20.2442%, 100.0000%]
2 50.00000000% [14.5765%, 85.4235%]
...
400 78.50000000% [74.1234%, 82.3456%]
Example 4: Save logits for detailed analysis
./llama-perplexity \
-m model-q4_k_m.gguf \
-f wiki.test.raw \
--logits-file logits-q4km.bin