Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Perplexity Evaluation

From Leeroopedia
Field Value
Implementation Name Perplexity Evaluation
Doc Type Wrapper Doc
Topic Model Quantization
Workflow Model_Quantization
Category Quality Evaluation
Repository Ggml_org_Llama_cpp

Overview

Description

The perplexity evaluation module implements two primary model quality assessment functions: perplexity() for measuring language modeling quality on raw text corpora, and hellaswag_score() for evaluating common-sense reasoning accuracy. These functions are used after quantization to validate that the quantized model maintains acceptable quality relative to the full-precision baseline.

The perplexity() function tokenizes a test corpus, processes it in context-sized chunks with batched decoding, and computes the exponential of the average negative log-likelihood. The hellaswag_score() function evaluates the model on the HellaSwag benchmark by comparing log-probabilities of candidate continuations and reporting accuracy with Wilson score confidence intervals.

Usage

Both functions are called by the llama-perplexity command-line tool, which loads a model and test data, then dispatches to the appropriate evaluation function based on the command-line arguments.

Code Reference

Source Location

  • perplexity function: tools/perplexity/perplexity.cpp (lines 441-659)
  • hellaswag_score function: tools/perplexity/perplexity.cpp (lines 741-1012)
  • results struct: tools/perplexity/perplexity.cpp (lines 25-30)

Signature

struct results_perplexity {
    std::vector<llama_token> tokens;
    double                   ppl_value;
    std::vector<float>       logits;
    std::vector<float>       probs;
};

// Compute perplexity over a text corpus
static results_perplexity perplexity(
    llama_context * ctx,
    const common_params & params,
    const int32_t n_ctx);

// Compute HellaSwag accuracy score
static void hellaswag_score(
    llama_context * ctx,
    const common_params & params);

Import

#include "common.h"
#include "llama.h"

I/O Contract

perplexity()

Direction Type Description
Input (ctx) llama_context * Initialized llama context with a loaded model
Input (params) const common_params & Runtime parameters including: prompt (the test text), n_ctx (context size), n_batch (batch size), n_chunks (max chunks to evaluate, -1 for all), ppl_stride (stride for v2 mode), ppl_output_type (output format), logits_file (optional logits dump path)
Input (n_ctx) int32_t Context window size used for chunking the test corpus
Output results_perplexity Struct containing: tokenized input, computed perplexity value, per-token logit history, per-token probability history
Side Effect stdout Prints per-chunk perplexity progress and final estimate with confidence interval

Processing steps:

  1. Tokenizes the input text with BOS token handling
  2. Validates that the corpus has at least 2 * n_ctx tokens
  3. Divides tokens into n_ctx-sized chunks
  4. For each chunk: clears KV cache, decodes in batches, collects logits for the second half of the context window
  5. Computes per-token negative log-likelihood using softmax over vocabulary logits
  6. Accumulates total NLL and NLL-squared for mean and standard deviation
  7. Returns exp(mean_nll) as the perplexity value with confidence interval +/- stddev * ppl

hellaswag_score()

Direction Type Description
Input (ctx) llama_context * Initialized llama context with a loaded model
Input (params) const common_params & Runtime parameters including: prompt (HellaSwag dataset in 6-lines-per-task format), hellaswag_tasks (number of tasks to evaluate), n_batch (batch size)
Output void Results printed to stdout
Side Effect stdout Prints per-task accuracy with Wilson score 95% confidence intervals

Processing steps:

  1. Parses the prompt into HellaSwag tasks (6 lines per task: context, gold label, 4 endings)
  2. Optionally randomizes task order (deterministic seed = 1)
  3. Tokenizes each task's context + ending combinations, computing common prefixes
  4. Batches multiple tasks into a single decode pass, sharing common prefix tokens across 4 sequence IDs
  5. Computes normalized log-probabilities for each ending
  6. Selects the ending with the highest average log-probability as the prediction
  7. Tracks accuracy and computes Wilson score confidence intervals

Usage Examples

Example 1: Measure perplexity on WikiText-2

# Download WikiText-2 test set
# Run perplexity evaluation
./llama-perplexity \
    -m model-q4_k_m.gguf \
    -f wikitext-2-raw/wiki.test.raw \
    -c 512 \
    --chunks 100

Expected output:

perplexity: calculating perplexity over 100 chunks, n_ctx=512, batch_size=2048, n_seq=4
[1]6.8234,[2]7.1023,...,[100]6.9456
Final estimate: PPL = 6.9456 +/- 0.03211

Example 2: Compare quantized vs baseline perplexity

# Baseline (F16)
./llama-perplexity -m model-f16.gguf -f wiki.test.raw
# Output: PPL = 6.2100

# Quantized (Q4_K_M)
./llama-perplexity -m model-q4_k_m.gguf -f wiki.test.raw
# Output: PPL = 6.3854
# Delta: +0.1754

Example 3: HellaSwag evaluation

./llama-perplexity \
    -m model-q4_k_m.gguf \
    --hellaswag \
    --hellaswag-tasks 400 \
    -f hellaswag_val_full.txt

Expected output:

task    acc_norm    95% confidence interval
1       100.00000000%   [20.2442%, 100.0000%]
2       50.00000000%    [14.5765%, 85.4235%]
...
400     78.50000000%    [74.1234%, 82.3456%]

Example 4: Save logits for detailed analysis

./llama-perplexity \
    -m model-q4_k_m.gguf \
    -f wiki.test.raw \
    --logits-file logits-q4km.bin

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment