Principle:Ggml org Llama cpp Evaluation Dataset Acquisition
| Aspect | Detail |
|---|---|
| Principle Name | Evaluation Dataset Acquisition |
| Domain | Model Perplexity Evaluation |
| Scope | Standard evaluation datasets for language model quality assessment |
| Related Workflow | Model_Perplexity_Evaluation |
Overview
Description
Language model evaluation requires standardized benchmark datasets that provide a consistent basis for measuring model quality. llama.cpp supports three primary evaluation benchmarks: WikiText-2 for perplexity measurement, HellaSwag for commonsense reasoning accuracy, and Winogrande for coreference resolution accuracy. Each dataset must be downloaded and formatted before it can be used with the perplexity evaluation tool.
Usage
Dataset acquisition is the first step in any model evaluation workflow. The user runs a provided shell script to download the appropriate dataset file, which is then passed to the llama-perplexity tool via the -f flag. Different datasets are used for different evaluation modes:
- WikiText-2: Used for standard perplexity (PPL) computation and KL divergence analysis
- HellaSwag: Used with the
--hellaswagflag for commonsense reasoning evaluation - Winogrande: Used with the
--winograndeflag for coreference resolution evaluation
Theoretical Basis
WikiText-2:
WikiText-2 is a language modeling benchmark derived from verified Good and Featured articles on Wikipedia. It contains approximately 2 million tokens of natural language text. The dataset is used to compute perplexity, which measures how well the model predicts the next token in a sequence. Lower perplexity indicates better language modeling quality. The raw version (wiki.test.raw) is preferred because it preserves the original text without tokenization artifacts, allowing llama.cpp's own tokenizer to process it directly.
HellaSwag:
HellaSwag (Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations) is a benchmark for evaluating commonsense reasoning. Each task presents a context (an activity description) and four possible continuations, only one of which is correct (the "gold" ending). The model must assign the highest normalized log-probability to the correct continuation. The dataset contains 10,042 validation tasks. The evaluation computes acc_norm (accuracy with normalized log-probabilities) along with 95% confidence intervals using Wilson score intervals.
Winogrande:
Winogrande is a benchmark for commonsense coreference resolution, inspired by the original Winograd Schema Challenge. Each task presents a sentence with a blank and two possible fill-in options. The model must assign higher probability to the correct option. This tests the model's ability to understand context-dependent pronoun resolution and world knowledge.
Why standardized datasets matter:
- Reproducibility: Using the same datasets allows comparison across different models, quantization levels, and inference engines
- Statistical validity: Large datasets provide sufficient sample sizes for meaningful confidence intervals
- Community alignment: These benchmarks are widely used in the LLM community (e.g., by EleutherAI's lm-evaluation-harness), enabling cross-framework comparison
The llama.cpp evaluation pipeline preprocesses the HellaSwag data into a 6-lines-per-task format:
- Context string (
activity_label: ctx) - Gold ending index (0-3)
- Ending 0
- Ending 1
- Ending 2
- Ending 3