Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:FlagOpen FlagEmbedding LLM Embedder Eval MMLU

From Leeroopedia


Knowledge Sources
Domains Natural Language Processing, Question Answering, Model Evaluation
Last Updated 2026-02-09 00:00 GMT

Overview

An evaluation framework for measuring language model performance on the MMLU (Massive Multitask Language Understanding) benchmark with optional retrieval augmentation.

Description

This module evaluates language models on MMLU, a comprehensive benchmark covering 57 subjects across STEM, Social Sciences, Humanities, and other domains. Each question is multiple-choice with 4 options (A, B, C, D).

The framework supports retrieval-augmented generation where external knowledge can be retrieved and prepended to questions. It implements perplexity-based evaluation where the model's likelihood for each option is computed, and the option with highest likelihood is selected as the answer. Results are aggregated by subject and grouped into 4 categories (STEM, Social Sciences, Humanities, Others) plus an overall average.

Key features include optional few-shot examples (0-5) from development set, retrieval integration (dense, BM25, or no retrieval), truncation from the left to preserve question and options, and automatic batching with proper label masking to compute perplexity only on the answer tokens.

Usage

Use this module to evaluate language models on diverse academic knowledge, to measure the impact of retrieval augmentation on factual question answering, or to benchmark model performance across different subject areas and difficulty levels.

Code Reference

Source Location

Signature

def main():
    """Main evaluation function for MMLU"""

def process_mmlu(
    tokenizer, context_max_length: int = 2048, key_num: int = 3,
    few_shot: int = 0, train_data: str = None, cache_dir: str = None,
    is_encoder_decoder: bool = False, add_llama_inst: bool = False
):
    """Create data processing function for MMLU"""

def evaluate_mmlu(eval_data: str, save_path: str, **kwds):
    """Compute MMLU metrics grouped by category"""

Import

from research.llm_embedder.evaluation.eval_mmlu import main, process_mmlu, evaluate_mmlu

I/O Contract

Inputs

Name Type Required Description
eval_data str Yes Test data JSON file
model_name_or_path str Yes LM to evaluate
retrieval_method str No dense/bm25/no (default: no)
few_shot int No Number of dev examples (default: 0)
key_num int No Top-k retrieved docs (default: 3)
context_max_length int No Max context length (default: 2048)

Outputs

Name Type Description
metrics dict Accuracy by category: STEM, Social Sciences, Humanities, Others, All
predictions list Per-sample predictions saved to output_dir

Data Format

Input Sample

{
    "query_id": "abstract_algebra_0",
    "subject": "abstract_algebra",
    "query": "Find the degree for the given field extension Q(sqrt(2)) over Q.",
    "choices": ["0", "2", "1", "Infinite"],
    "answer": 1,  # Index of correct answer (B)
    "key": ["Retrieved passage 1...", "Retrieved passage 2...", ...]  # Optional
}

Formatted Prompt

"""
The following are multiple choice questions (with answers) about abstract algebra.

<Few-shot examples if few_shot > 0>

Knowledge:
<Retrieved passages if key_num > 0>

Find the degree for the given field extension Q(sqrt(2)) over Q.
A. 0
B. 2
C. 1
D. Infinite
Answer: B
"""

Subject Categories

SUBJECT_2_CATEGORY = {
    # STEM (26 subjects)
    "abstract_algebra": "STEM",
    "astronomy": "STEM",
    "college_biology": "STEM",
    "college_chemistry": "STEM",
    "college_computer_science": "STEM",
    "college_mathematics": "STEM",
    "college_physics": "STEM",
    "computer_security": "STEM",
    # ... 18 more STEM subjects

    # Social Sciences (13 subjects)
    "econometrics": "Social Sciences",
    "high_school_geography": "Social Sciences",
    "high_school_government_and_politics": "Social Sciences",
    # ... 10 more Social Sciences subjects

    # Humanities (12 subjects)
    "formal_logic": "Humanities",
    "high_school_european_history": "Humanities",
    "high_school_us_history": "Humanities",
    # ... 9 more Humanities subjects

    # Others (6 subjects)
    "anatomy": "others",
    "clinical_knowledge": "others",
    "medical_genetics": "others",
    # ... 3 more others subjects
}

Usage Examples

Zero-Shot Evaluation

python research/llm_embedder/evaluation/eval_mmlu.py \
    --eval_data llm-embedder:qa/mmlu/test.json \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --retrieval_method no \
    --few_shot 0 \
    --output_dir data/results/mmlu/zero_shot \
    --lm_batch_size 2

Few-Shot with Retrieval

python research/llm_embedder/evaluation/eval_mmlu.py \
    --eval_data llm-embedder:qa/mmlu/test.json \
    --train_data llm-embedder:qa/mmlu/dev.json \
    --corpus llm-embedder:qa/msmarco/corpus.json \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --query_encoder BAAI/llm-embedder \
    --retrieval_method dense \
    --few_shot 5 \
    --key_num 3 \
    --context_max_length 2048 \
    --output_dir data/results/mmlu/5shot_retrieval

Analyze Results by Category

import json

with open('data/results/mmlu/result.json') as f:
    results = json.load(f)

print(f"STEM: {results['STEM']:.1%}")
print(f"Social Sciences: {results['Social Sciences']:.1%}")
print(f"Humanities: {results['Humanities']:.1%}")
print(f"Others: {results['Others']:.1%}")
print(f"Overall: {results['All']:.1%}")

# Example output:
# STEM: 45.2%
# Social Sciences: 52.8%
# Humanities: 48.6%
# Others: 50.3%
# Overall: 48.9%

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment