Implementation:FlagOpen FlagEmbedding LLM Embedder Eval MMLU
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Question Answering, Model Evaluation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
An evaluation framework for measuring language model performance on the MMLU (Massive Multitask Language Understanding) benchmark with optional retrieval augmentation.
Description
This module evaluates language models on MMLU, a comprehensive benchmark covering 57 subjects across STEM, Social Sciences, Humanities, and other domains. Each question is multiple-choice with 4 options (A, B, C, D).
The framework supports retrieval-augmented generation where external knowledge can be retrieved and prepended to questions. It implements perplexity-based evaluation where the model's likelihood for each option is computed, and the option with highest likelihood is selected as the answer. Results are aggregated by subject and grouped into 4 categories (STEM, Social Sciences, Humanities, Others) plus an overall average.
Key features include optional few-shot examples (0-5) from development set, retrieval integration (dense, BM25, or no retrieval), truncation from the left to preserve question and options, and automatic batching with proper label masking to compute perplexity only on the answer tokens.
Usage
Use this module to evaluate language models on diverse academic knowledge, to measure the impact of retrieval augmentation on factual question answering, or to benchmark model performance across different subject areas and difficulty levels.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/llm_embedder/evaluation/eval_mmlu.py
- Lines: 1-319
Signature
def main():
"""Main evaluation function for MMLU"""
def process_mmlu(
tokenizer, context_max_length: int = 2048, key_num: int = 3,
few_shot: int = 0, train_data: str = None, cache_dir: str = None,
is_encoder_decoder: bool = False, add_llama_inst: bool = False
):
"""Create data processing function for MMLU"""
def evaluate_mmlu(eval_data: str, save_path: str, **kwds):
"""Compute MMLU metrics grouped by category"""
Import
from research.llm_embedder.evaluation.eval_mmlu import main, process_mmlu, evaluate_mmlu
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| eval_data | str | Yes | Test data JSON file |
| model_name_or_path | str | Yes | LM to evaluate |
| retrieval_method | str | No | dense/bm25/no (default: no) |
| few_shot | int | No | Number of dev examples (default: 0) |
| key_num | int | No | Top-k retrieved docs (default: 3) |
| context_max_length | int | No | Max context length (default: 2048) |
Outputs
| Name | Type | Description |
|---|---|---|
| metrics | dict | Accuracy by category: STEM, Social Sciences, Humanities, Others, All |
| predictions | list | Per-sample predictions saved to output_dir |
Data Format
Input Sample
{
"query_id": "abstract_algebra_0",
"subject": "abstract_algebra",
"query": "Find the degree for the given field extension Q(sqrt(2)) over Q.",
"choices": ["0", "2", "1", "Infinite"],
"answer": 1, # Index of correct answer (B)
"key": ["Retrieved passage 1...", "Retrieved passage 2...", ...] # Optional
}
Formatted Prompt
"""
The following are multiple choice questions (with answers) about abstract algebra.
<Few-shot examples if few_shot > 0>
Knowledge:
<Retrieved passages if key_num > 0>
Find the degree for the given field extension Q(sqrt(2)) over Q.
A. 0
B. 2
C. 1
D. Infinite
Answer: B
"""
Subject Categories
SUBJECT_2_CATEGORY = {
# STEM (26 subjects)
"abstract_algebra": "STEM",
"astronomy": "STEM",
"college_biology": "STEM",
"college_chemistry": "STEM",
"college_computer_science": "STEM",
"college_mathematics": "STEM",
"college_physics": "STEM",
"computer_security": "STEM",
# ... 18 more STEM subjects
# Social Sciences (13 subjects)
"econometrics": "Social Sciences",
"high_school_geography": "Social Sciences",
"high_school_government_and_politics": "Social Sciences",
# ... 10 more Social Sciences subjects
# Humanities (12 subjects)
"formal_logic": "Humanities",
"high_school_european_history": "Humanities",
"high_school_us_history": "Humanities",
# ... 9 more Humanities subjects
# Others (6 subjects)
"anatomy": "others",
"clinical_knowledge": "others",
"medical_genetics": "others",
# ... 3 more others subjects
}
Usage Examples
Zero-Shot Evaluation
python research/llm_embedder/evaluation/eval_mmlu.py \
--eval_data llm-embedder:qa/mmlu/test.json \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--retrieval_method no \
--few_shot 0 \
--output_dir data/results/mmlu/zero_shot \
--lm_batch_size 2
Few-Shot with Retrieval
python research/llm_embedder/evaluation/eval_mmlu.py \
--eval_data llm-embedder:qa/mmlu/test.json \
--train_data llm-embedder:qa/mmlu/dev.json \
--corpus llm-embedder:qa/msmarco/corpus.json \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--query_encoder BAAI/llm-embedder \
--retrieval_method dense \
--few_shot 5 \
--key_num 3 \
--context_max_length 2048 \
--output_dir data/results/mmlu/5shot_retrieval
Analyze Results by Category
import json
with open('data/results/mmlu/result.json') as f:
results = json.load(f)
print(f"STEM: {results['STEM']:.1%}")
print(f"Social Sciences: {results['Social Sciences']:.1%}")
print(f"Humanities: {results['Humanities']:.1%}")
print(f"Others: {results['Others']:.1%}")
print(f"Overall: {results['All']:.1%}")
# Example output:
# STEM: 45.2%
# Social Sciences: 52.8%
# Humanities: 48.6%
# Others: 50.3%
# Overall: 48.9%