Principle:Mit han lab Llm awq LM Evaluation Harness Adaptation

Overview

Adapter pattern that wraps quantized language models to conform to the lm-evaluation-harness interface for standardized benchmark evaluation.

Description

The EleutherAI lm-evaluation-harness defines a BaseLM interface that standardizes how models are queried for benchmarks (PIQA, HellaSwag, WinoGrande, ARC, etc.). Adapting a custom quantized model requires implementing methods for:

Tokenization - tok_encode and tok_decode for converting between text and token IDs
Forward pass - _model_call for computing logits over a batch of token IDs
Generation - _model_generate for autoregressive text generation
Properties - eot_token_id, max_length, max_gen_toks, batch_size, device

This adapter pattern enables fair comparison with published quantization results on standard benchmarks. Without it, each quantization method would need its own evaluation scripts, making reproducible comparison difficult.

The key challenge is that quantized models may have different forward pass signatures, device placement strategies, or tokenizer configurations compared to standard HuggingFace models. The adapter normalizes these differences behind the BaseLM interface.

Usage

When evaluating quantized model quality on standard NLP benchmarks:

Wrap the quantized model and tokenizer in the adapter
Pass the adapter to the lm-evaluation-harness evaluation pipeline
Run standard benchmark tasks (e.g., PIQA, HellaSwag, WinoGrande, ARC-Easy, ARC-Challenge)
Compare results against published baselines

Related Pages

Implementation:Mit_han_lab_Llm_awq_LMEvalAdaptor

Knowledge Sources

Repo|llm-awq|https://github.com/mit-han-lab/llm-awq
Repo|lm-eval-harness|https://github.com/EleutherAI/lm-evaluation-harness

Domains

NLP
Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment