Principle:Mit han lab Llm awq WikiText Perplexity Evaluation

Overview

Standardized language modeling evaluation that measures the perplexity of a quantized model on the WikiText-2 test set using sliding window cross-entropy.

Description

Perplexity (PPL) is the primary metric for evaluating language model quality after quantization. The evaluation procedure works as follows:

WikiText-2 raw text is loaded and concatenated into a single string
The text is tokenized into a flat sequence of token IDs
The sequence is split into non-overlapping windows of 2048 tokens
For each window, the model computes next-token logits via a forward pass
Cross-entropy loss is computed between the shifted logits and the shifted labels
Final PPL = exp(average_loss)

Lower PPL indicates better language modeling quality, meaning the model assigns higher probability to the correct next tokens. This is the standard evaluation used in all AWQ/GPTQ/RTN quantization papers, making it the primary metric for comparing quantization methods.

Theoretical Basis

PPL = exp(-1/N * sum(log P(w_i | w_{<i})))

The evaluation uses a sliding window with seqlen=2048. Each window is processed independently with no KV cache carryover between windows. The loss is accumulated across all windows and averaged by the total number of tokens.

Usage

As the primary quality metric when evaluating quantized models (triggered by --tasks wikitext):

Load the quantized model onto GPU
Run the WikiText-2 evaluation loop
Compare the resulting PPL against baseline (FP16) and other quantization methods
Typical results: 4-bit AWQ achieves PPL within 0.1-0.5 of the FP16 baseline

Related Pages

Implementation:Mit_han_lab_Llm_awq_Wikitext_eval_loop

Knowledge Sources

Paper|AWQ|https://arxiv.org/abs/2306.00978

Domains

NLP
Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment