Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Perplexity Function

From Leeroopedia
Aspect Detail
Implementation Name Perplexity Function
Doc Type API Doc
Domain Model Perplexity Evaluation
Purpose Core computation functions for perplexity, HellaSwag, and Winogrande evaluation
Related Workflow Model_Perplexity_Evaluation
Core Yes

Overview

Description

This implementation documents the three primary evaluation functions in the llama-perplexity tool:

  • perplexity(): Computes standard perplexity (PPL) over a text dataset by processing it in fixed-size chunks and computing the exponential of the average negative log-likelihood
  • hellaswag_score(): Evaluates commonsense reasoning by scoring 4-way multiple choice tasks from the HellaSwag dataset
  • winogrande_score(): Evaluates coreference resolution by scoring 2-way fill-in tasks from the Winogrande dataset

Usage

These functions are called from main() based on the evaluation mode selected via CLI arguments. Only one function runs per invocation. Each function receives the loaded llama_context and the parsed common_params.

Code Reference

Aspect Detail
Source Location (perplexity) tools/perplexity/perplexity.cpp:441-659
Source Location (hellaswag) tools/perplexity/perplexity.cpp:741-1012
Source Location (dispatch) tools/perplexity/perplexity.cpp:2051-2061
Import #include "common.h", #include "llama.h"

perplexity() signature:

static results_perplexity perplexity(llama_context * ctx,
                                      const common_params & params,
                                      const int32_t n_ctx);

hellaswag_score() signature:

static void hellaswag_score(llama_context * ctx, const common_params & params);

winogrande_score() signature:

static void winogrande_score(llama_context * ctx, const common_params & params);

Return type for perplexity:

struct results_perplexity {
    std::vector<llama_token> tokens;
    double                   ppl_value;
    std::vector<float>       logits;
    std::vector<float>       probs;
};

Dispatch logic (main()):

if (params.hellaswag) {
    hellaswag_score(ctx, params);
} else if (params.winogrande) {
    winogrande_score(ctx, params);
} else if (params.multiple_choice) {
    multiple_choice_score(ctx, params);
} else if (params.kl_divergence) {
    kl_divergence(ctx, params);
} else {
    results = perplexity(ctx, params, n_ctx);
}

Perplexity core loop (perplexity.cpp:539-642, abbreviated):

// Evaluate over the second half of each context window
const int first = n_ctx/2;

for (int i = 0; i < n_chunk; i += n_seq) {
    const int start = i * n_ctx;
    const int end   = start + n_ctx;

    // Clear KV cache for fresh context
    llama_memory_clear(llama_get_memory(ctx), true);

    // Process in batches
    for (int j = 0; j < num_batches; ++j) {
        // Fill batch with tokens, enable logits for positions >= first
        // ...
        llama_decode(ctx, batch);
    }

    // Accumulate NLL from logits
    // process_logits() computes log-probability of correct next token
    // and adds to running nll and nll2 accumulators
    count += n_ctx - first - 1;

    // Report intermediate PPL: exp(nll / count)
}

// Final PPL computation
nll /= count;
const double ppl = exp(nll);

HellaSwag core loop (perplexity.cpp:876-1007, abbreviated):

for (size_t i0 = 0; i0 < hs_task_count; i0++) {
    // Batch multiple tasks, sharing common prefix with 4 sequences each
    // Each ending gets its own sequence ID: s0+0, s0+1, s0+2, s0+3
    // ...

    // Decode all batched tasks
    decode_helper(ctx, batch, batch_logits, n_batch, n_vocab);

    // Compute normalized log-probability for each ending
    for (size_t i = i0; i < i1; ++i) {
        for (int s = 0; s < 4; ++s) {
            hs_cur.ending_logprob[s] /= hs_cur.ending_logprob_count[s];
        }

        // Select ending with maximum normalized log-prob
        // Compare with gold ending, accumulate accuracy
        if (ending_logprob_max_idx == hs_cur.gold_ending_idx) {
            acc += 1.0;
        }

        // Wilson score confidence interval
        double z   = za * za / double(i + 1);
        double cnf = z * sqrt(double(i+1) * (4.0*freq*(1-freq) + z)) / (za + za);
        double a   = (freq + z*0.5 - cnf) / (1.0 + z);
        double b   = (freq + z*0.5 + cnf) / (1.0 + z);
    }
}

I/O Contract

perplexity():

Direction Name Type Description
Input ctx llama_context * Inference context with loaded model
Input params const common_params & Parameters including prompt (loaded dataset text), n_chunks, n_batch, ppl_stride, logits_file
Input n_ctx int32_t Context window size for evaluation
Output (return) results_perplexity Contains tokens, final PPL value, per-token logits, and per-token probabilities

hellaswag_score():

Direction Name Type Description
Input ctx llama_context * Inference context with loaded model
Input params const common_params & Parameters including prompt (loaded HellaSwag data), hellaswag_tasks, n_batch
Output (stdout) text Per-task accuracy and Wilson score confidence intervals

winogrande_score():

Direction Name Type Description
Input ctx llama_context * Inference context with loaded model
Input params const common_params & Parameters including prompt (loaded Winogrande data), winogrande_tasks, n_batch
Output (stdout) text Per-task accuracy and confidence intervals

Usage Examples

Example 1: Standard perplexity evaluation

# Compute perplexity over WikiText-2 with 512 context window
./llama-perplexity -m model.gguf \
    -f wikitext-2-raw/wiki.test.raw \
    --ctx-size 512 \
    --batch-size 2048 \
    -ngl 99

# Output:
# perplexity: calculating perplexity over 114 chunks, n_ctx=512, batch_size=512, n_seq=4
# [1]8.2134,[2]7.9872,...,[114]6.3215
# Final estimate: PPL = 6.3215 +/- 0.04821

Example 2: HellaSwag evaluation

./llama-perplexity -m model.gguf \
    -f hellaswag_val_full.txt \
    --hellaswag \
    --hellaswag-tasks 10042 \
    -ngl 99

# Output:
# task  acc_norm  95% confidence interval
# 1     100.00000000%  [20.2434%, 100.0000%]
# 2     50.00000000%   [13.5695%, 86.4305%]
# ...
# 10042 76.12345678%   [75.2891%, 76.9578%]

Example 3: Winogrande evaluation

./llama-perplexity -m model.gguf \
    -f winogrande-debiased-eval.csv \
    --winogrande \
    --winogrande-tasks 1267 \
    -ngl 99

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment