Implementation:Turboderp org Exllamav2 MMLU Benchmark

Knowledge Sources	Turboderp_org_Exllamav2
Domains	Evaluation, Benchmarking
Last Updated	2026-02-15 00:00 GMT

Overview

CLI evaluation script that runs the Massive Multitask Language Understanding (MMLU) benchmark against an ExLlamaV2-quantized model, reporting per-subject accuracy and average confidence.

Description

mmlu.py is a command-line tool that loads an EXL2 model through model_init, creates an ExLlamaV2DynamicGenerator with max_batch_size=1024, and evaluates multiple-choice question answering across MMLU subjects.

Key components:

Answer token mapping -- Maps answer choices A through D to their token IDs via tokenizer.single_id() with a leading space (e.g., " A"). A reverse map (token_rmap) is built for looking up results. The sampler is constrained to only these four tokens using gen_settings.allow_tokens().
Few-shot learning -- The -fs/--fewshot_examples argument (default 5, maximum 5) controls how many solved examples from the dev split are prepended to each question as a prompt prefix. Preprompts are pre-tokenized and cached per subject.
format_question() function -- Formats a question string with its four labeled choices (A, B, C, D) and an optional answer line, returning the formatted text block.
Subject filtering -- The -sub/--subjects argument accepts a comma-separated list of MMLU subject names or all (default). Only matching subjects from the cais/mmlu dataset are tested.
Choice shuffling -- When -shf/--shuffle is passed, the four answer choices are randomly permuted per question and the correct answer index is updated accordingly, preventing positional bias in evaluation.
Job creation -- Each question is enqueued as an ExLlamaV2DynamicJob with max_new_tokens=1 and return_top_tokens=4, returning token probabilities rather than full completions.
Result aggregation -- The correct answer confidence is extracted from the top-K returned tokens. Final output reports the number correct, total, accuracy percentage, and mean confidence.

Usage

Use this script to measure MMLU accuracy of any EXL2-quantized model. It supports few-shot prompting, subject filtering, and choice shuffling for robust evaluation of language understanding across 57 subjects.

Code Reference

Source Location

Repository: Turboderp_org_Exllamav2
File: eval/mmlu.py
Lines: L1-197

Signature

# CLI entry point -- no importable class; executed directly
parser = argparse.ArgumentParser(description="Run MMLU evaluation on EXL2 model")
parser.add_argument("-cs", "--cache_size", type=int, default=None)
parser.add_argument("-cq4", "--cache_q4", action="store_true")
parser.add_argument("-cq6", "--cache_q6", action="store_true")
parser.add_argument("-cq8", "--cache_q8", action="store_true")
parser.add_argument("-sub", "--subjects", type=str, default="all")
parser.add_argument("-fs", "--fewshot_examples", type=int, default=5)
parser.add_argument("-shf", "--shuffle", action="store_true")

def format_question(question: str, choices: list[str], answer: int | None) -> str:
    ...

Import

# Script executed directly via CLI
python eval/mmlu.py -m /path/to/model -sub all -fs 5

I/O Contract

Inputs

Name	Type	Required	Description
-m / --model_dir	str	Yes	Path to EXL2/HuggingFace model directory (via model_init)
-cs / --cache_size	int	No	Override KV cache sequence length
-cq4 / -cq6 / -cq8	flag	No	Use Q4, Q6, or Q8 quantised KV cache respectively
-sub / --subjects	str	No (default "all")	Comma-separated list of MMLU subjects or "all"
-fs / --fewshot_examples	int	No (default 5)	Number of few-shot examples to prepend (0-5)
-shf / --shuffle	flag	No	Randomly shuffle the four answer choices per question

Outputs

Name	Type	Description
Accuracy summary	text	Printed to console: "Correct answers: X/Y = Z%" showing count and percentage
Confidence summary	text	Printed to console: "Confidence: Z%" showing mean probability assigned to correct answer

Usage Examples

Full MMLU 5-Shot Evaluation

# Run full MMLU with 5-shot prompting
python eval/mmlu.py \
    -m /models/llama3-70b-exl2 \
    -sub all \
    -fs 5

Filtered Subjects with Shuffled Choices

# Evaluate only specific subjects with Q4 cache and shuffled choices
python eval/mmlu.py \
    -m /models/mistral-7b-exl2 \
    -sub abstract_algebra,college_mathematics \
    -fs 3 \
    -shf \
    -cq4

Zero-Shot Evaluation

# Zero-shot evaluation (no few-shot examples)
python eval/mmlu.py \
    -m /models/qwen2-72b-exl2 \
    -fs 0

Related Pages

Implements Principle

Principle:Turboderp_org_Exllamav2_Benchmark_Evaluation

Requires Environment

Environment:Turboderp_org_Exllamav2_CUDA_GPU_Runtime

Depends On

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment