Workflow:Mit han lab Llm awq AWQ Model Evaluation

Knowledge Sources	llm-awq AWQ: Activation-aware Weight Quantization lm-eval-harness
Domains	LLMs, Model_Evaluation, Quantization
Last Updated	2025-04-01 00:00 GMT

Overview

End-to-end process for evaluating AWQ-quantized language models on standard benchmarks to measure accuracy degradation from quantization.

Description

This workflow evaluates the quality of AWQ-quantized models by measuring their performance on language modeling benchmarks. It supports two evaluation modes: WikiText-2 perplexity evaluation (built-in) and multi-task evaluation via the lm-eval-harness framework. The workflow can evaluate models at different stages of the quantization pipeline: pseudo-quantized (simulated FP16 rounding), real-quantized (packed INT4), or from pre-existing quantized checkpoints. This allows users to verify that quantization quality meets their requirements before deploying to production.

Usage

Execute this workflow after running the AWQ search (or loading pre-computed search results) to verify that quantization has not degraded model quality beyond acceptable thresholds. It is particularly useful for comparing different quantization configurations (e.g., 3-bit vs 4-bit, different group sizes) or validating that a new model family quantizes well with AWQ.

Execution Steps

Step 1: Load Model and AWQ Results

Load the base model from its HuggingFace path and, if evaluating a pseudo-quantized model, load the pre-computed AWQ search results (scales and clips). Apply the AWQ transforms to the model weights to prepare them for quantization simulation. Alternatively, load a previously saved real-quantized checkpoint directly.

Key considerations:

Two loading paths exist: apply AWQ + pseudo-quantize for simulation, or load a pre-quantized .pt file for real evaluation
Multi-GPU parallelism is automatically configured when available via the accelerate library
Model is dispatched across available devices using inferred device maps

Step 2: Apply Quantization Mode

Depending on the evaluation target, apply the appropriate quantization mode. Pseudo (fake) quantization rounds weights to INT4 precision but keeps them as FP16 tensors, enabling standard PyTorch evaluation. Real quantization packs weights into WQLinear modules for evaluation under actual deployment conditions.

Key considerations:

Pseudo quantization is faster to evaluate and does not require CUDA kernels
Real quantization evaluation reflects actual deployment accuracy more faithfully
Both modes share the same AWQ search results

Step 3: Run WikiText Perplexity Evaluation

Evaluate the model on the WikiText-2 test set by computing the perplexity (PPL) score. The full test set is tokenized, split into fixed-length sequences (2048 tokens), and each sequence is scored using the model's log-likelihood. The exponential of the average negative log-likelihood gives the perplexity score.

What happens:

Load WikiText-2 raw test split from HuggingFace datasets
Tokenize and split into 2048-token windows
Compute cross-entropy loss per window in a no-gradient forward pass
Report the overall perplexity value

Step 4: Run Benchmark Tasks (Optional)

For broader evaluation beyond perplexity, use the lm-eval-harness integration to run the model on standard NLP benchmarks (e.g., HellaSwag, ARC, WinoGrande). The model is wrapped in an LMEvalAdaptor that implements the BaseLM interface expected by the evaluation harness, then tasks are executed and results are tabulated.

Key considerations:

Any task supported by lm-eval-harness can be specified as a comma-separated list
Few-shot evaluation is supported via the num_fewshot parameter
Results can be saved to a JSON file for comparison across runs

Step 5: Report and Save Results

Display evaluation results (perplexity score or benchmark table) and optionally save them as a JSON file for later comparison. Results include the model path and configuration used, enabling reproducible comparisons across quantization settings.

Execution Diagram

GitHub URL

Workflow Repository