Principle:Ggml org Llama cpp Evaluation Dataset Acquisition

Aspect	Detail
Principle Name	Evaluation Dataset Acquisition
Domain	Model Perplexity Evaluation
Scope	Standard evaluation datasets for language model quality assessment
Related Workflow	Model_Perplexity_Evaluation

Overview

Description

Language model evaluation requires standardized benchmark datasets that provide a consistent basis for measuring model quality. llama.cpp supports three primary evaluation benchmarks: WikiText-2 for perplexity measurement, HellaSwag for commonsense reasoning accuracy, and Winogrande for coreference resolution accuracy. Each dataset must be downloaded and formatted before it can be used with the perplexity evaluation tool.

Usage

Dataset acquisition is the first step in any model evaluation workflow. The user runs a provided shell script to download the appropriate dataset file, which is then passed to the llama-perplexity tool via the -f flag. Different datasets are used for different evaluation modes:

WikiText-2: Used for standard perplexity (PPL) computation and KL divergence analysis
HellaSwag: Used with the --hellaswag flag for commonsense reasoning evaluation
Winogrande: Used with the --winogrande flag for coreference resolution evaluation

Theoretical Basis

WikiText-2:

WikiText-2 is a language modeling benchmark derived from verified Good and Featured articles on Wikipedia. It contains approximately 2 million tokens of natural language text. The dataset is used to compute perplexity, which measures how well the model predicts the next token in a sequence. Lower perplexity indicates better language modeling quality. The raw version (wiki.test.raw) is preferred because it preserves the original text without tokenization artifacts, allowing llama.cpp's own tokenizer to process it directly.

HellaSwag:

HellaSwag (Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations) is a benchmark for evaluating commonsense reasoning. Each task presents a context (an activity description) and four possible continuations, only one of which is correct (the "gold" ending). The model must assign the highest normalized log-probability to the correct continuation. The dataset contains 10,042 validation tasks. The evaluation computes acc_norm (accuracy with normalized log-probabilities) along with 95% confidence intervals using Wilson score intervals.

Winogrande:

Winogrande is a benchmark for commonsense coreference resolution, inspired by the original Winograd Schema Challenge. Each task presents a sentence with a blank and two possible fill-in options. The model must assign higher probability to the correct option. This tests the model's ability to understand context-dependent pronoun resolution and world knowledge.

Why standardized datasets matter:

Reproducibility: Using the same datasets allows comparison across different models, quantization levels, and inference engines
Statistical validity: Large datasets provide sufficient sample sizes for meaningful confidence intervals
Community alignment: These benchmarks are widely used in the LLM community (e.g., by EleutherAI's lm-evaluation-harness), enabling cross-framework comparison

The llama.cpp evaluation pipeline preprocesses the HellaSwag data into a 6-lines-per-task format:

Context string (activity_label: ctx)
Gold ending index (0-3)
Ending 0
Ending 1
Ending 2
Ending 3

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment