Principle:Turboderp org Exllamav2 Benchmark Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Benchmarking, NLP |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Benchmark evaluation measures how well a quantized or optimized model preserves the capabilities of its original full-precision counterpart by running standardized evaluation tasks.
Description
When models are quantized (e.g., to EXL2 format) or loaded with different precision settings, benchmark evaluation provides objective metrics to assess quality degradation. ExLlamaV2 includes CLI tools for two major benchmarks:
- HumanEval: A code generation benchmark consisting of 164 Python programming problems. The model generates function implementations from docstrings, which are then executed against unit tests. The pass@k metric measures the probability that at least one of k generated solutions passes all tests.
- MMLU (Massive Multitask Language Understanding): A multiple-choice benchmark covering 57 subjects across STEM, humanities, social sciences, and more. The model selects the correct answer from four choices. Accuracy is computed per-subject and averaged.
These benchmarks are particularly important for quantization workflows, where users need to verify that lower-bit representations maintain acceptable accuracy.
Usage
Benchmark evaluation is used in these contexts:
- Quantization validation: Compare EXL2 quantized model accuracy against the original FP16 model
- Bits-per-weight selection: Evaluate different quantization levels (e.g., 3.0, 4.0, 5.0 bpw) to find the best accuracy/size trade-off
- Model comparison: Compare different model architectures or fine-tunes on standardized tasks
- Regression testing: Verify that engine updates do not degrade generation quality
Theoretical Basis
Pass@k Metric (HumanEval)
# Generate n code samples per problem
# Count c = number of correct samples (passing all tests)
# pass@k = 1 - C(n-c, k) / C(n, k)
# Where C(a,b) is the binomial coefficient
# k <= n, typically k=1 for single-attempt accuracy
MMLU Scoring
# For each question with choices A, B, C, D:
# Compare log-probabilities of single tokens: "A", "B", "C", "D"
# predicted = argmax(log_prob("A"), log_prob("B"), log_prob("C"), log_prob("D"))
# accuracy = correct_predictions / total_questions
# Per-subject and overall averages are reported
Related Pages
Implemented By
- Implementation:Turboderp_org_Exllamav2_HumanEval_Benchmark
- Implementation:Turboderp_org_Exllamav2_MMLU_Benchmark