Principle:Mit han lab Llm awq WikiText Perplexity Evaluation
Overview
Standardized language modeling evaluation that measures the perplexity of a quantized model on the WikiText-2 test set using sliding window cross-entropy.
Description
Perplexity (PPL) is the primary metric for evaluating language model quality after quantization. The evaluation procedure works as follows:
- WikiText-2 raw text is loaded and concatenated into a single string
- The text is tokenized into a flat sequence of token IDs
- The sequence is split into non-overlapping windows of 2048 tokens
- For each window, the model computes next-token logits via a forward pass
- Cross-entropy loss is computed between the shifted logits and the shifted labels
- Final PPL = exp(average_loss)
Lower PPL indicates better language modeling quality, meaning the model assigns higher probability to the correct next tokens. This is the standard evaluation used in all AWQ/GPTQ/RTN quantization papers, making it the primary metric for comparing quantization methods.
Theoretical Basis
PPL = exp(-1/N * sum(log P(w_i | w_{<i})))
The evaluation uses a sliding window with seqlen=2048. Each window is processed independently with no KV cache carryover between windows. The loss is accumulated across all windows and averaged by the total number of tokens.
Usage
As the primary quality metric when evaluating quantized models (triggered by --tasks wikitext):
- Load the quantized model onto GPU
- Run the WikiText-2 evaluation loop
- Compare the resulting PPL against baseline (FP16) and other quantization methods
- Typical results: 4-bit AWQ achieves PPL within 0.1-0.5 of the FP16 baseline
Related Pages
Knowledge Sources
- Paper|AWQ|https://arxiv.org/abs/2306.00978
Domains
- NLP
- Evaluation