Principle:Ggml org Llama cpp Evaluation Parameter Configuration
| Aspect | Detail |
|---|---|
| Principle Name | Evaluation Parameter Configuration |
| Domain | Model Perplexity Evaluation |
| Scope | Configuring evaluation parameters: chunk size, task selection, stride |
| Related Workflow | Model_Perplexity_Evaluation |
Overview
Description
Perplexity evaluation in llama.cpp is highly configurable through command-line arguments that control evaluation mode, dataset processing parameters, and output format. Proper parameter configuration is essential for obtaining meaningful, reproducible evaluation results. The parameters determine which benchmark is run, how much of the dataset is processed, and how the results are computed and reported.
Usage
Parameters are specified as command-line arguments to the llama-perplexity tool. They are parsed by the common argument parser and stored in a common_params structure. The evaluation mode is selected by mutually exclusive flags (--hellaswag, --winogrande, --multiple-choice, --kl-divergence), with standard perplexity being the default when none of these flags is set.
Theoretical Basis
Evaluation Mode Selection:
The choice of evaluation mode determines what quality metric is computed:
- Standard perplexity (default): Measures how well the model predicts the next token in natural text. This is the most common evaluation metric for language models.
- HellaSwag: Measures commonsense reasoning by evaluating the model's ability to choose the most plausible continuation of a story.
- Winogrande: Measures coreference resolution ability by testing context-dependent word choice.
- Multiple choice: A generalized multiple-choice evaluation format.
- KL divergence: Measures how much a quantized model's output distribution diverges from a reference (typically FP16) model's distribution.
Context Size (--ctx-size):
The context window size directly affects perplexity measurements. A larger context window gives the model more prior tokens to condition on, generally resulting in lower (better) perplexity. The standard value is 512 tokens, matching common benchmarking practices. The default for the perplexity tool is 512.
Batch Size (--batch-size):
The batch size controls how many tokens are processed per llama_decode() call. Larger batch sizes improve GPU utilization but require more memory. For perplexity evaluation, the batch size also determines the number of parallel sequences: n_seq = max(1, n_batch / n_ctx).
Chunk Count (--chunks):
For perplexity evaluation, the dataset is divided into fixed-size chunks of n_ctx tokens. The --chunks parameter limits how many chunks are processed. Using -1 (default) processes all available chunks. Limiting chunks is useful for quick preliminary evaluations.
Stride (--ppl-stride):
The stride parameter controls an alternative perplexity computation method where instead of non-overlapping chunks, the context window slides by a fixed number of tokens between evaluations. This can provide more granular perplexity estimates but is slower. When stride is set, the context size is automatically increased by stride/2.
Task Count (--hellaswag-tasks, --winogrande-tasks):
For benchmark evaluations, the task count controls how many evaluation examples are processed. Using fewer tasks provides faster but noisier estimates. The full HellaSwag validation set contains 10,042 tasks; Winogrande contains 1,267 debiased evaluation entries.
Output Format (--ppl-output-type):
Controls the format of per-chunk perplexity output:
- Type 0 (default): Compact format showing chunk index and cumulative perplexity
- Type 1: Verbose format showing chunk index, perplexity, mean NLL, and standard deviation
KL Divergence Parameters:
For KL divergence analysis, the --kl-divergence-base (or --save-all-logits) flag specifies a file containing reference logits from a higher-precision model. The perplexity tool can both generate this reference file (when run in standard perplexity mode with --save-all-logits) and consume it (when run in KL divergence mode).