Implementation:Ggml org Llama cpp Perplexity Function

Aspect	Detail
Implementation Name	Perplexity Function
Doc Type	API Doc
Domain	Model Perplexity Evaluation
Purpose	Core computation functions for perplexity, HellaSwag, and Winogrande evaluation
Related Workflow	Model_Perplexity_Evaluation
Core	Yes

Overview

Description

This implementation documents the three primary evaluation functions in the llama-perplexity tool:

perplexity(): Computes standard perplexity (PPL) over a text dataset by processing it in fixed-size chunks and computing the exponential of the average negative log-likelihood
hellaswag_score(): Evaluates commonsense reasoning by scoring 4-way multiple choice tasks from the HellaSwag dataset
winogrande_score(): Evaluates coreference resolution by scoring 2-way fill-in tasks from the Winogrande dataset

Usage

These functions are called from main() based on the evaluation mode selected via CLI arguments. Only one function runs per invocation. Each function receives the loaded llama_context and the parsed common_params.

Code Reference

Aspect	Detail
Source Location (perplexity)	`tools/perplexity/perplexity.cpp:441-659`
Source Location (hellaswag)	`tools/perplexity/perplexity.cpp:741-1012`
Source Location (dispatch)	`tools/perplexity/perplexity.cpp:2051-2061`
Import	`#include "common.h"`, `#include "llama.h"`

perplexity() signature:

static results_perplexity perplexity(llama_context * ctx,
                                      const common_params & params,
                                      const int32_t n_ctx);

hellaswag_score() signature:

static void hellaswag_score(llama_context * ctx, const common_params & params);

winogrande_score() signature:

static void winogrande_score(llama_context * ctx, const common_params & params);

Return type for perplexity:

struct results_perplexity {
    std::vector<llama_token> tokens;
    double                   ppl_value;
    std::vector<float>       logits;
    std::vector<float>       probs;
};

Dispatch logic (main()):

if (params.hellaswag) {
    hellaswag_score(ctx, params);
} else if (params.winogrande) {
    winogrande_score(ctx, params);
} else if (params.multiple_choice) {
    multiple_choice_score(ctx, params);
} else if (params.kl_divergence) {
    kl_divergence(ctx, params);
} else {
    results = perplexity(ctx, params, n_ctx);
}

Perplexity core loop (perplexity.cpp:539-642, abbreviated):

// Evaluate over the second half of each context window
const int first = n_ctx/2;

for (int i = 0; i < n_chunk; i += n_seq) {
    const int start = i * n_ctx;
    const int end   = start + n_ctx;

    // Clear KV cache for fresh context
    llama_memory_clear(llama_get_memory(ctx), true);

    // Process in batches
    for (int j = 0; j < num_batches; ++j) {
        // Fill batch with tokens, enable logits for positions >= first
        // ...
        llama_decode(ctx, batch);
    }

    // Accumulate NLL from logits
    // process_logits() computes log-probability of correct next token
    // and adds to running nll and nll2 accumulators
    count += n_ctx - first - 1;

    // Report intermediate PPL: exp(nll / count)
}

// Final PPL computation
nll /= count;
const double ppl = exp(nll);

HellaSwag core loop (perplexity.cpp:876-1007, abbreviated):

for (size_t i0 = 0; i0 < hs_task_count; i0++) {
    // Batch multiple tasks, sharing common prefix with 4 sequences each
    // Each ending gets its own sequence ID: s0+0, s0+1, s0+2, s0+3
    // ...

    // Decode all batched tasks
    decode_helper(ctx, batch, batch_logits, n_batch, n_vocab);

    // Compute normalized log-probability for each ending
    for (size_t i = i0; i < i1; ++i) {
        for (int s = 0; s < 4; ++s) {
            hs_cur.ending_logprob[s] /= hs_cur.ending_logprob_count[s];
        }

        // Select ending with maximum normalized log-prob
        // Compare with gold ending, accumulate accuracy
        if (ending_logprob_max_idx == hs_cur.gold_ending_idx) {
            acc += 1.0;
        }

        // Wilson score confidence interval
        double z   = za * za / double(i + 1);
        double cnf = z * sqrt(double(i+1) * (4.0*freq*(1-freq) + z)) / (za + za);
        double a   = (freq + z*0.5 - cnf) / (1.0 + z);
        double b   = (freq + z*0.5 + cnf) / (1.0 + z);
    }
}

I/O Contract

perplexity():

Direction	Name	Type	Description
Input	ctx	`llama_context *`	Inference context with loaded model
Input	params	`const common_params &`	Parameters including `prompt` (loaded dataset text), `n_chunks`, `n_batch`, `ppl_stride`, `logits_file`
Input	n_ctx	`int32_t`	Context window size for evaluation
Output	(return)	`results_perplexity`	Contains tokens, final PPL value, per-token logits, and per-token probabilities

hellaswag_score():

Direction	Name	Type	Description
Input	ctx	`llama_context *`	Inference context with loaded model
Input	params	`const common_params &`	Parameters including `prompt` (loaded HellaSwag data), `hellaswag_tasks`, `n_batch`
Output	(stdout)	text	Per-task accuracy and Wilson score confidence intervals

winogrande_score():

Direction	Name	Type	Description
Input	ctx	`llama_context *`	Inference context with loaded model
Input	params	`const common_params &`	Parameters including `prompt` (loaded Winogrande data), `winogrande_tasks`, `n_batch`
Output	(stdout)	text	Per-task accuracy and confidence intervals

Usage Examples

Example 1: Standard perplexity evaluation

# Compute perplexity over WikiText-2 with 512 context window
./llama-perplexity -m model.gguf \
    -f wikitext-2-raw/wiki.test.raw \
    --ctx-size 512 \
    --batch-size 2048 \
    -ngl 99

# Output:
# perplexity: calculating perplexity over 114 chunks, n_ctx=512, batch_size=512, n_seq=4
# [1]8.2134,[2]7.9872,...,[114]6.3215
# Final estimate: PPL = 6.3215 +/- 0.04821

Example 2: HellaSwag evaluation

./llama-perplexity -m model.gguf \
    -f hellaswag_val_full.txt \
    --hellaswag \
    --hellaswag-tasks 10042 \
    -ngl 99

# Output:
# task  acc_norm  95% confidence interval
# 1     100.00000000%  [20.2434%, 100.0000%]
# 2     50.00000000%   [13.5695%, 86.4305%]
# ...
# 10042 76.12345678%   [75.2891%, 76.9578%]

Example 3: Winogrande evaluation

./llama-perplexity -m model.gguf \
    -f winogrande-debiased-eval.csv \
    --winogrande \
    --winogrande-tasks 1267 \
    -ngl 99

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment