Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Turboderp org Exllamav2 MMLU Benchmark

From Leeroopedia
Knowledge Sources
Domains Evaluation, Benchmarking
Last Updated 2026-02-15 00:00 GMT

Overview

CLI evaluation script that runs the Massive Multitask Language Understanding (MMLU) benchmark against an ExLlamaV2-quantized model, reporting per-subject accuracy and average confidence.

Description

mmlu.py is a command-line tool that loads an EXL2 model through model_init, creates an ExLlamaV2DynamicGenerator with max_batch_size=1024, and evaluates multiple-choice question answering across MMLU subjects.

Key components:

  • Answer token mapping -- Maps answer choices A through D to their token IDs via tokenizer.single_id() with a leading space (e.g., " A"). A reverse map (token_rmap) is built for looking up results. The sampler is constrained to only these four tokens using gen_settings.allow_tokens().
  • Few-shot learning -- The -fs/--fewshot_examples argument (default 5, maximum 5) controls how many solved examples from the dev split are prepended to each question as a prompt prefix. Preprompts are pre-tokenized and cached per subject.
  • format_question() function -- Formats a question string with its four labeled choices (A, B, C, D) and an optional answer line, returning the formatted text block.
  • Subject filtering -- The -sub/--subjects argument accepts a comma-separated list of MMLU subject names or all (default). Only matching subjects from the cais/mmlu dataset are tested.
  • Choice shuffling -- When -shf/--shuffle is passed, the four answer choices are randomly permuted per question and the correct answer index is updated accordingly, preventing positional bias in evaluation.
  • Job creation -- Each question is enqueued as an ExLlamaV2DynamicJob with max_new_tokens=1 and return_top_tokens=4, returning token probabilities rather than full completions.
  • Result aggregation -- The correct answer confidence is extracted from the top-K returned tokens. Final output reports the number correct, total, accuracy percentage, and mean confidence.

Usage

Use this script to measure MMLU accuracy of any EXL2-quantized model. It supports few-shot prompting, subject filtering, and choice shuffling for robust evaluation of language understanding across 57 subjects.

Code Reference

Source Location

Signature

# CLI entry point -- no importable class; executed directly
parser = argparse.ArgumentParser(description="Run MMLU evaluation on EXL2 model")
parser.add_argument("-cs", "--cache_size", type=int, default=None)
parser.add_argument("-cq4", "--cache_q4", action="store_true")
parser.add_argument("-cq6", "--cache_q6", action="store_true")
parser.add_argument("-cq8", "--cache_q8", action="store_true")
parser.add_argument("-sub", "--subjects", type=str, default="all")
parser.add_argument("-fs", "--fewshot_examples", type=int, default=5)
parser.add_argument("-shf", "--shuffle", action="store_true")

def format_question(question: str, choices: list[str], answer: int | None) -> str:
    ...

Import

# Script executed directly via CLI
python eval/mmlu.py -m /path/to/model -sub all -fs 5

I/O Contract

Inputs

Name Type Required Description
-m / --model_dir str Yes Path to EXL2/HuggingFace model directory (via model_init)
-cs / --cache_size int No Override KV cache sequence length
-cq4 / -cq6 / -cq8 flag No Use Q4, Q6, or Q8 quantised KV cache respectively
-sub / --subjects str No (default "all") Comma-separated list of MMLU subjects or "all"
-fs / --fewshot_examples int No (default 5) Number of few-shot examples to prepend (0-5)
-shf / --shuffle flag No Randomly shuffle the four answer choices per question

Outputs

Name Type Description
Accuracy summary text Printed to console: "Correct answers: X/Y = Z%" showing count and percentage
Confidence summary text Printed to console: "Confidence: Z%" showing mean probability assigned to correct answer

Usage Examples

Full MMLU 5-Shot Evaluation

# Run full MMLU with 5-shot prompting
python eval/mmlu.py \
    -m /models/llama3-70b-exl2 \
    -sub all \
    -fs 5

Filtered Subjects with Shuffled Choices

# Evaluate only specific subjects with Q4 cache and shuffled choices
python eval/mmlu.py \
    -m /models/mistral-7b-exl2 \
    -sub abstract_algebra,college_mathematics \
    -fs 3 \
    -shf \
    -cq4

Zero-Shot Evaluation

# Zero-shot evaluation (no few-shot examples)
python eval/mmlu.py \
    -m /models/qwen2-72b-exl2 \
    -fs 0

Related Pages

Implements Principle

Requires Environment

Depends On

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment