Implementation:Ggml org Llama cpp Perplexity Function
| Aspect | Detail |
|---|---|
| Implementation Name | Perplexity Function |
| Doc Type | API Doc |
| Domain | Model Perplexity Evaluation |
| Purpose | Core computation functions for perplexity, HellaSwag, and Winogrande evaluation |
| Related Workflow | Model_Perplexity_Evaluation |
| Core | Yes |
Overview
Description
This implementation documents the three primary evaluation functions in the llama-perplexity tool:
perplexity(): Computes standard perplexity (PPL) over a text dataset by processing it in fixed-size chunks and computing the exponential of the average negative log-likelihoodhellaswag_score(): Evaluates commonsense reasoning by scoring 4-way multiple choice tasks from the HellaSwag datasetwinogrande_score(): Evaluates coreference resolution by scoring 2-way fill-in tasks from the Winogrande dataset
Usage
These functions are called from main() based on the evaluation mode selected via CLI arguments. Only one function runs per invocation. Each function receives the loaded llama_context and the parsed common_params.
Code Reference
| Aspect | Detail |
|---|---|
| Source Location (perplexity) | tools/perplexity/perplexity.cpp:441-659
|
| Source Location (hellaswag) | tools/perplexity/perplexity.cpp:741-1012
|
| Source Location (dispatch) | tools/perplexity/perplexity.cpp:2051-2061
|
| Import | #include "common.h", #include "llama.h"
|
perplexity() signature:
static results_perplexity perplexity(llama_context * ctx,
const common_params & params,
const int32_t n_ctx);
hellaswag_score() signature:
static void hellaswag_score(llama_context * ctx, const common_params & params);
winogrande_score() signature:
static void winogrande_score(llama_context * ctx, const common_params & params);
Return type for perplexity:
struct results_perplexity {
std::vector<llama_token> tokens;
double ppl_value;
std::vector<float> logits;
std::vector<float> probs;
};
Dispatch logic (main()):
if (params.hellaswag) {
hellaswag_score(ctx, params);
} else if (params.winogrande) {
winogrande_score(ctx, params);
} else if (params.multiple_choice) {
multiple_choice_score(ctx, params);
} else if (params.kl_divergence) {
kl_divergence(ctx, params);
} else {
results = perplexity(ctx, params, n_ctx);
}
Perplexity core loop (perplexity.cpp:539-642, abbreviated):
// Evaluate over the second half of each context window
const int first = n_ctx/2;
for (int i = 0; i < n_chunk; i += n_seq) {
const int start = i * n_ctx;
const int end = start + n_ctx;
// Clear KV cache for fresh context
llama_memory_clear(llama_get_memory(ctx), true);
// Process in batches
for (int j = 0; j < num_batches; ++j) {
// Fill batch with tokens, enable logits for positions >= first
// ...
llama_decode(ctx, batch);
}
// Accumulate NLL from logits
// process_logits() computes log-probability of correct next token
// and adds to running nll and nll2 accumulators
count += n_ctx - first - 1;
// Report intermediate PPL: exp(nll / count)
}
// Final PPL computation
nll /= count;
const double ppl = exp(nll);
HellaSwag core loop (perplexity.cpp:876-1007, abbreviated):
for (size_t i0 = 0; i0 < hs_task_count; i0++) {
// Batch multiple tasks, sharing common prefix with 4 sequences each
// Each ending gets its own sequence ID: s0+0, s0+1, s0+2, s0+3
// ...
// Decode all batched tasks
decode_helper(ctx, batch, batch_logits, n_batch, n_vocab);
// Compute normalized log-probability for each ending
for (size_t i = i0; i < i1; ++i) {
for (int s = 0; s < 4; ++s) {
hs_cur.ending_logprob[s] /= hs_cur.ending_logprob_count[s];
}
// Select ending with maximum normalized log-prob
// Compare with gold ending, accumulate accuracy
if (ending_logprob_max_idx == hs_cur.gold_ending_idx) {
acc += 1.0;
}
// Wilson score confidence interval
double z = za * za / double(i + 1);
double cnf = z * sqrt(double(i+1) * (4.0*freq*(1-freq) + z)) / (za + za);
double a = (freq + z*0.5 - cnf) / (1.0 + z);
double b = (freq + z*0.5 + cnf) / (1.0 + z);
}
}
I/O Contract
perplexity():
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | ctx | llama_context * |
Inference context with loaded model |
| Input | params | const common_params & |
Parameters including prompt (loaded dataset text), n_chunks, n_batch, ppl_stride, logits_file
|
| Input | n_ctx | int32_t |
Context window size for evaluation |
| Output | (return) | results_perplexity |
Contains tokens, final PPL value, per-token logits, and per-token probabilities |
hellaswag_score():
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | ctx | llama_context * |
Inference context with loaded model |
| Input | params | const common_params & |
Parameters including prompt (loaded HellaSwag data), hellaswag_tasks, n_batch
|
| Output | (stdout) | text | Per-task accuracy and Wilson score confidence intervals |
winogrande_score():
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | ctx | llama_context * |
Inference context with loaded model |
| Input | params | const common_params & |
Parameters including prompt (loaded Winogrande data), winogrande_tasks, n_batch
|
| Output | (stdout) | text | Per-task accuracy and confidence intervals |
Usage Examples
Example 1: Standard perplexity evaluation
# Compute perplexity over WikiText-2 with 512 context window
./llama-perplexity -m model.gguf \
-f wikitext-2-raw/wiki.test.raw \
--ctx-size 512 \
--batch-size 2048 \
-ngl 99
# Output:
# perplexity: calculating perplexity over 114 chunks, n_ctx=512, batch_size=512, n_seq=4
# [1]8.2134,[2]7.9872,...,[114]6.3215
# Final estimate: PPL = 6.3215 +/- 0.04821
Example 2: HellaSwag evaluation
./llama-perplexity -m model.gguf \
-f hellaswag_val_full.txt \
--hellaswag \
--hellaswag-tasks 10042 \
-ngl 99
# Output:
# task acc_norm 95% confidence interval
# 1 100.00000000% [20.2434%, 100.0000%]
# 2 50.00000000% [13.5695%, 86.4305%]
# ...
# 10042 76.12345678% [75.2891%, 76.9578%]
Example 3: Winogrande evaluation
./llama-perplexity -m model.gguf \
-f winogrande-debiased-eval.csv \
--winogrande \
--winogrande-tasks 1267 \
-ngl 99