Principle:VainF Torch Pruning LLM Perplexity Evaluation
| Property | Value |
|---|---|
| Domains | NLP, Evaluation, Pruning |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Evaluating the quality degradation of pruned language models by measuring perplexity on standard benchmarks.
Description
Perplexity (PPL) is the standard metric for evaluating language model quality. It measures how well the model predicts the next token: the lower the perplexity, the better the model's predictions. For pruned LLMs, comparing perplexity before and after pruning quantifies the accuracy loss from compression.
The evaluation pipeline works as follows:
- Load the benchmark dataset -- WikiText-2 is the standard benchmark. The test split is tokenized into a single long sequence of token IDs.
- Segment into windows -- The tokenized sequence is divided into non-overlapping windows of
seqlentokens (typically 2048 or 4096, matching the model's maximum sequence length). - Compute loss per window -- For each window, the model produces logits. The cross-entropy loss is computed between the shifted logits (positions 0..N-2) and the shifted labels (positions 1..N-1), following the standard next-token prediction setup.
- Aggregate into perplexity -- The negative log-likelihoods from all windows are summed and normalized by total tokens, then exponentiated to produce the final perplexity score.
Key considerations:
- The model must have a
.seqlenattribute that defines the window size for evaluation. This is typically set tomodel.config.max_position_embeddingsor capped at a practical limit (e.g., 4096). - Evaluation is performed with
torch.no_grad()to avoid gradient computation overhead. - CUDA cache is cleared after evaluation to free GPU memory.
Usage
Use after structural pruning of LLMs (Llama, Phi, Qwen, etc.) to measure quality degradation. Perplexity is the primary quality metric for LLM pruning workflows. A typical workflow is:
- Measure baseline perplexity of the unpruned model.
- Apply structural pruning with the desired ratio.
- Measure post-pruning perplexity.
- (Optional) Fine-tune the pruned model and re-evaluate.
The perplexity delta between steps 1 and 3 quantifies the cost of pruning. Acceptable degradation depends on the application, but a perplexity increase of less than 1-2 points is generally considered good for moderate pruning ratios (20-30%).
Theoretical Basis
Perplexity is defined as:
PPL = exp( -1/N * sum_{i=1}^{N} log P(w_i | w_1, ..., w_{i-1}) )
where N is the total number of tokens and P(w_i | w_1, ..., w_{i-1}) is the model's predicted probability for token w_i given all preceding tokens.
In practice, this is computed as:
PPL = exp( mean(CrossEntropyLoss) )
The cross-entropy loss at each position is:
CE(i) = -log P(w_i | w_1, ..., w_{i-1})
The implementation in Torch-Pruning computes this over non-overlapping windows of seqlen tokens from the WikiText-2 test set. For each window, the total NLL is loss * seqlen * batch_size, and the final perplexity is:
# For each window i of seqlen tokens:
nll_i = cross_entropy_loss * seqlen * batch_size
# Final perplexity:
ppl = torch.exp(sum(nlls) / (nsamples * seqlen))
This sliding-window approach avoids the quadratic cost of processing the entire test set as a single sequence while still providing a reliable estimate of model quality.