Principle:VainF Torch Pruning LLM Perplexity Evaluation

Property	Value
Domains	NLP, Evaluation, Pruning
Last Updated	2026-02-08 00:00 GMT

Overview

Evaluating the quality degradation of pruned language models by measuring perplexity on standard benchmarks.

Description

Perplexity (PPL) is the standard metric for evaluating language model quality. It measures how well the model predicts the next token: the lower the perplexity, the better the model's predictions. For pruned LLMs, comparing perplexity before and after pruning quantifies the accuracy loss from compression.

The evaluation pipeline works as follows:

Load the benchmark dataset -- WikiText-2 is the standard benchmark. The test split is tokenized into a single long sequence of token IDs.
Segment into windows -- The tokenized sequence is divided into non-overlapping windows of seqlen tokens (typically 2048 or 4096, matching the model's maximum sequence length).
Compute loss per window -- For each window, the model produces logits. The cross-entropy loss is computed between the shifted logits (positions 0..N-2) and the shifted labels (positions 1..N-1), following the standard next-token prediction setup.
Aggregate into perplexity -- The negative log-likelihoods from all windows are summed and normalized by total tokens, then exponentiated to produce the final perplexity score.

Key considerations:

The model must have a .seqlen attribute that defines the window size for evaluation. This is typically set to model.config.max_position_embeddings or capped at a practical limit (e.g., 4096).
Evaluation is performed with torch.no_grad() to avoid gradient computation overhead.
CUDA cache is cleared after evaluation to free GPU memory.

Usage

Use after structural pruning of LLMs (Llama, Phi, Qwen, etc.) to measure quality degradation. Perplexity is the primary quality metric for LLM pruning workflows. A typical workflow is:

Measure baseline perplexity of the unpruned model.
Apply structural pruning with the desired ratio.
Measure post-pruning perplexity.
(Optional) Fine-tune the pruned model and re-evaluate.

The perplexity delta between steps 1 and 3 quantifies the cost of pruning. Acceptable degradation depends on the application, but a perplexity increase of less than 1-2 points is generally considered good for moderate pruning ratios (20-30%).

Theoretical Basis

Perplexity is defined as:

PPL = exp( -1/N * sum_{i=1}^{N} log P(w_i | w_1, ..., w_{i-1}) )

where N is the total number of tokens and P(w_i | w_1, ..., w_{i-1}) is the model's predicted probability for token w_i given all preceding tokens.

In practice, this is computed as:

PPL = exp( mean(CrossEntropyLoss) )

The cross-entropy loss at each position is:

CE(i) = -log P(w_i | w_1, ..., w_{i-1})

The implementation in Torch-Pruning computes this over non-overlapping windows of seqlen tokens from the WikiText-2 test set. For each window, the total NLL is loss * seqlen * batch_size, and the final perplexity is:

# For each window i of seqlen tokens:
nll_i = cross_entropy_loss * seqlen * batch_size

# Final perplexity:
ppl = torch.exp(sum(nlls) / (nsamples * seqlen))

This sliding-window approach avoids the quadratic cost of processing the entire test set as a single sequence while still providing a reliable estimate of model quality.

Related Pages

Implementation:VainF_Torch_Pruning_Eval_PPL

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment