Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Mit han lab Llm awq AWQ Model Evaluation

From Leeroopedia
Knowledge Sources
Domains LLMs, Model_Evaluation, Quantization
Last Updated 2025-04-01 00:00 GMT

Overview

End-to-end process for evaluating AWQ-quantized language models on standard benchmarks to measure accuracy degradation from quantization.

Description

This workflow evaluates the quality of AWQ-quantized models by measuring their performance on language modeling benchmarks. It supports two evaluation modes: WikiText-2 perplexity evaluation (built-in) and multi-task evaluation via the lm-eval-harness framework. The workflow can evaluate models at different stages of the quantization pipeline: pseudo-quantized (simulated FP16 rounding), real-quantized (packed INT4), or from pre-existing quantized checkpoints. This allows users to verify that quantization quality meets their requirements before deploying to production.

Usage

Execute this workflow after running the AWQ search (or loading pre-computed search results) to verify that quantization has not degraded model quality beyond acceptable thresholds. It is particularly useful for comparing different quantization configurations (e.g., 3-bit vs 4-bit, different group sizes) or validating that a new model family quantizes well with AWQ.

Execution Steps

Step 1: Load Model and AWQ Results

Load the base model from its HuggingFace path and, if evaluating a pseudo-quantized model, load the pre-computed AWQ search results (scales and clips). Apply the AWQ transforms to the model weights to prepare them for quantization simulation. Alternatively, load a previously saved real-quantized checkpoint directly.

Key considerations:

  • Two loading paths exist: apply AWQ + pseudo-quantize for simulation, or load a pre-quantized .pt file for real evaluation
  • Multi-GPU parallelism is automatically configured when available via the accelerate library
  • Model is dispatched across available devices using inferred device maps

Step 2: Apply Quantization Mode

Depending on the evaluation target, apply the appropriate quantization mode. Pseudo (fake) quantization rounds weights to INT4 precision but keeps them as FP16 tensors, enabling standard PyTorch evaluation. Real quantization packs weights into WQLinear modules for evaluation under actual deployment conditions.

Key considerations:

  • Pseudo quantization is faster to evaluate and does not require CUDA kernels
  • Real quantization evaluation reflects actual deployment accuracy more faithfully
  • Both modes share the same AWQ search results

Step 3: Run WikiText Perplexity Evaluation

Evaluate the model on the WikiText-2 test set by computing the perplexity (PPL) score. The full test set is tokenized, split into fixed-length sequences (2048 tokens), and each sequence is scored using the model's log-likelihood. The exponential of the average negative log-likelihood gives the perplexity score.

What happens:

  • Load WikiText-2 raw test split from HuggingFace datasets
  • Tokenize and split into 2048-token windows
  • Compute cross-entropy loss per window in a no-gradient forward pass
  • Report the overall perplexity value

Step 4: Run Benchmark Tasks (Optional)

For broader evaluation beyond perplexity, use the lm-eval-harness integration to run the model on standard NLP benchmarks (e.g., HellaSwag, ARC, WinoGrande). The model is wrapped in an LMEvalAdaptor that implements the BaseLM interface expected by the evaluation harness, then tasks are executed and results are tabulated.

Key considerations:

  • Any task supported by lm-eval-harness can be specified as a comma-separated list
  • Few-shot evaluation is supported via the num_fewshot parameter
  • Results can be saved to a JSON file for comparison across runs

Step 5: Report and Save Results

Display evaluation results (perplexity score or benchmark table) and optionally save them as a JSON file for later comparison. Results include the model path and configuration used, enabling reproducible comparisons across quantization settings.

Execution Diagram

GitHub URL

Workflow Repository