Workflow:Mit han lab Llm awq AWQ Model Evaluation
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Model_Evaluation, Quantization |
| Last Updated | 2025-04-01 00:00 GMT |
Overview
End-to-end process for evaluating AWQ-quantized language models on standard benchmarks to measure accuracy degradation from quantization.
Description
This workflow evaluates the quality of AWQ-quantized models by measuring their performance on language modeling benchmarks. It supports two evaluation modes: WikiText-2 perplexity evaluation (built-in) and multi-task evaluation via the lm-eval-harness framework. The workflow can evaluate models at different stages of the quantization pipeline: pseudo-quantized (simulated FP16 rounding), real-quantized (packed INT4), or from pre-existing quantized checkpoints. This allows users to verify that quantization quality meets their requirements before deploying to production.
Usage
Execute this workflow after running the AWQ search (or loading pre-computed search results) to verify that quantization has not degraded model quality beyond acceptable thresholds. It is particularly useful for comparing different quantization configurations (e.g., 3-bit vs 4-bit, different group sizes) or validating that a new model family quantizes well with AWQ.
Execution Steps
Step 1: Load Model and AWQ Results
Load the base model from its HuggingFace path and, if evaluating a pseudo-quantized model, load the pre-computed AWQ search results (scales and clips). Apply the AWQ transforms to the model weights to prepare them for quantization simulation. Alternatively, load a previously saved real-quantized checkpoint directly.
Key considerations:
- Two loading paths exist: apply AWQ + pseudo-quantize for simulation, or load a pre-quantized .pt file for real evaluation
- Multi-GPU parallelism is automatically configured when available via the accelerate library
- Model is dispatched across available devices using inferred device maps
Step 2: Apply Quantization Mode
Depending on the evaluation target, apply the appropriate quantization mode. Pseudo (fake) quantization rounds weights to INT4 precision but keeps them as FP16 tensors, enabling standard PyTorch evaluation. Real quantization packs weights into WQLinear modules for evaluation under actual deployment conditions.
Key considerations:
- Pseudo quantization is faster to evaluate and does not require CUDA kernels
- Real quantization evaluation reflects actual deployment accuracy more faithfully
- Both modes share the same AWQ search results
Step 3: Run WikiText Perplexity Evaluation
Evaluate the model on the WikiText-2 test set by computing the perplexity (PPL) score. The full test set is tokenized, split into fixed-length sequences (2048 tokens), and each sequence is scored using the model's log-likelihood. The exponential of the average negative log-likelihood gives the perplexity score.
What happens:
- Load WikiText-2 raw test split from HuggingFace datasets
- Tokenize and split into 2048-token windows
- Compute cross-entropy loss per window in a no-gradient forward pass
- Report the overall perplexity value
Step 4: Run Benchmark Tasks (Optional)
For broader evaluation beyond perplexity, use the lm-eval-harness integration to run the model on standard NLP benchmarks (e.g., HellaSwag, ARC, WinoGrande). The model is wrapped in an LMEvalAdaptor that implements the BaseLM interface expected by the evaluation harness, then tasks are executed and results are tabulated.
Key considerations:
- Any task supported by lm-eval-harness can be specified as a comma-separated list
- Few-shot evaluation is supported via the num_fewshot parameter
- Results can be saved to a JSON file for comparison across runs
Step 5: Report and Save Results
Display evaluation results (perplexity score or benchmark table) and optionally save them as a JSON file for later comparison. Results include the model path and configuration used, enabling reproducible comparisons across quantization settings.