Implementation:FlagOpen FlagEmbedding LLM Embedder Eval MMLU

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Natural Language Processing, Question Answering, Model Evaluation
Last Updated	2026-02-09 00:00 GMT

Overview

An evaluation framework for measuring language model performance on the MMLU (Massive Multitask Language Understanding) benchmark with optional retrieval augmentation.

Description

This module evaluates language models on MMLU, a comprehensive benchmark covering 57 subjects across STEM, Social Sciences, Humanities, and other domains. Each question is multiple-choice with 4 options (A, B, C, D).

The framework supports retrieval-augmented generation where external knowledge can be retrieved and prepended to questions. It implements perplexity-based evaluation where the model's likelihood for each option is computed, and the option with highest likelihood is selected as the answer. Results are aggregated by subject and grouped into 4 categories (STEM, Social Sciences, Humanities, Others) plus an overall average.

Key features include optional few-shot examples (0-5) from development set, retrieval integration (dense, BM25, or no retrieval), truncation from the left to preserve question and options, and automatic batching with proper label masking to compute perplexity only on the answer tokens.

Usage

Use this module to evaluate language models on diverse academic knowledge, to measure the impact of retrieval augmentation on factual question answering, or to benchmark model performance across different subject areas and difficulty levels.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/llm_embedder/evaluation/eval_mmlu.py
Lines: 1-319

Signature

def main():
    """Main evaluation function for MMLU"""

def process_mmlu(
    tokenizer, context_max_length: int = 2048, key_num: int = 3,
    few_shot: int = 0, train_data: str = None, cache_dir: str = None,
    is_encoder_decoder: bool = False, add_llama_inst: bool = False
):
    """Create data processing function for MMLU"""

def evaluate_mmlu(eval_data: str, save_path: str, **kwds):
    """Compute MMLU metrics grouped by category"""

Import

from research.llm_embedder.evaluation.eval_mmlu import main, process_mmlu, evaluate_mmlu

I/O Contract

Inputs

Name	Type	Required	Description
eval_data	str	Yes	Test data JSON file
model_name_or_path	str	Yes	LM to evaluate
retrieval_method	str	No	dense/bm25/no (default: no)
few_shot	int	No	Number of dev examples (default: 0)
key_num	int	No	Top-k retrieved docs (default: 3)
context_max_length	int	No	Max context length (default: 2048)

Outputs

Name	Type	Description
metrics	dict	Accuracy by category: STEM, Social Sciences, Humanities, Others, All
predictions	list	Per-sample predictions saved to output_dir

Data Format

Input Sample

{
    "query_id": "abstract_algebra_0",
    "subject": "abstract_algebra",
    "query": "Find the degree for the given field extension Q(sqrt(2)) over Q.",
    "choices": ["0", "2", "1", "Infinite"],
    "answer": 1,  # Index of correct answer (B)
    "key": ["Retrieved passage 1...", "Retrieved passage 2...", ...]  # Optional
}

Formatted Prompt

"""
The following are multiple choice questions (with answers) about abstract algebra.

<Few-shot examples if few_shot > 0>

Knowledge:
<Retrieved passages if key_num > 0>

Find the degree for the given field extension Q(sqrt(2)) over Q.
A. 0
B. 2
C. 1
D. Infinite
Answer: B
"""

Subject Categories

SUBJECT_2_CATEGORY = {
    # STEM (26 subjects)
    "abstract_algebra": "STEM",
    "astronomy": "STEM",
    "college_biology": "STEM",
    "college_chemistry": "STEM",
    "college_computer_science": "STEM",
    "college_mathematics": "STEM",
    "college_physics": "STEM",
    "computer_security": "STEM",
    # ... 18 more STEM subjects

    # Social Sciences (13 subjects)
    "econometrics": "Social Sciences",
    "high_school_geography": "Social Sciences",
    "high_school_government_and_politics": "Social Sciences",
    # ... 10 more Social Sciences subjects

    # Humanities (12 subjects)
    "formal_logic": "Humanities",
    "high_school_european_history": "Humanities",
    "high_school_us_history": "Humanities",
    # ... 9 more Humanities subjects

    # Others (6 subjects)
    "anatomy": "others",
    "clinical_knowledge": "others",
    "medical_genetics": "others",
    # ... 3 more others subjects
}

Usage Examples

Zero-Shot Evaluation

python research/llm_embedder/evaluation/eval_mmlu.py \
    --eval_data llm-embedder:qa/mmlu/test.json \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --retrieval_method no \
    --few_shot 0 \
    --output_dir data/results/mmlu/zero_shot \
    --lm_batch_size 2

Few-Shot with Retrieval

python research/llm_embedder/evaluation/eval_mmlu.py \
    --eval_data llm-embedder:qa/mmlu/test.json \
    --train_data llm-embedder:qa/mmlu/dev.json \
    --corpus llm-embedder:qa/msmarco/corpus.json \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --query_encoder BAAI/llm-embedder \
    --retrieval_method dense \
    --few_shot 5 \
    --key_num 3 \
    --context_max_length 2048 \
    --output_dir data/results/mmlu/5shot_retrieval

Analyze Results by Category

import json

with open('data/results/mmlu/result.json') as f:
    results = json.load(f)

print(f"STEM: {results['STEM']:.1%}")
print(f"Social Sciences: {results['Social Sciences']:.1%}")
print(f"Humanities: {results['Humanities']:.1%}")
print(f"Others: {results['Others']:.1%}")
print(f"Overall: {results['All']:.1%}")

# Example output:
# STEM: 45.2%
# Social Sciences: 52.8%
# Humanities: 48.6%
# Others: 50.3%
# Overall: 48.9%

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment