Implementation:FlagOpen FlagEmbedding LLM Embedder Eval ICL
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, In-Context Learning, Model Evaluation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
An evaluation framework for measuring language model performance on in-context learning tasks across 30 diverse datasets spanning 9 task categories.
Description
This module evaluates language models on 30 in-context learning (ICL) tasks organized into 9 categories: Closed-book QA (CQA), Commonsense reasoning, Coreference resolution, Paraphrase detection, Natural Language Inference (NLI), Reading Comprehension, Sentiment Analysis, Data-to-Text generation, and Summarization.
The evaluation supports two modes: perplexity-based (for classification tasks) and generation-based (for open-ended tasks). It integrates with retrieval systems to provide few-shot examples (0-8 shots) retrieved via dense retrieval, BM25, or random selection. Tasks include ARC, HellaSwag, COPA, WinoGrande, MRPC, SNLI, SQuAD, MultiRC, SST-2, CommonGen, and Gigaword.
The framework supports multiple retrieval methods (dense, BM25, random, same-task-random, or no retrieval), computes task-specific metrics (accuracy, F1, exact match, ROUGE-L), and aggregates results by category and overall average.
Usage
Use this module to evaluate how well language models leverage in-context examples for diverse NLP tasks, to compare different retrieval strategies for ICL example selection, or to benchmark model performance across multiple reasoning and generation capabilities.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/llm_embedder/evaluation/eval_icl.py
- Lines: 1-355
Signature
def main():
"""Main evaluation function for ICL tasks"""
def load_test_data(
knn_inxs, test_data, corpus_data,
filter_diff_task: bool = False, example_num: int = 8,
same_task_random: bool = False
) -> Dict[str, List]:
"""Load test data with retrieved few-shot examples"""
Import
from research.llm_embedder.evaluation.eval_icl import main, load_test_data
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| eval_data | str | Yes | Test data JSON file path |
| corpus | str | Yes | Corpus for retrieval |
| model_name_or_path | str | Yes | LM to evaluate |
| retrieval_method | str | No | dense/bm25/random/no (default: dense) |
| few_shot | int | No | Number of examples (default: 8) |
| context_max_length | int | No | Max context length (default: 1024) |
Outputs
| Name | Type | Description |
|---|---|---|
| metrics | dict | Per-category and overall scores |
| results | dict | Per-task detailed results saved to output_dir |
Task Categories
Supported Tasks
# 9 Categories, 30 Tasks:
CQA = {
"arc_c": {'method': 'perplexity', 'metric': 'acc'},
"arc_e": {'method': 'perplexity', 'metric': 'acc'},
"natural_questions": {'method': 'generation', 'metric': 'em'}
}
Commonsense = {
"copa": {'method': 'perplexity', 'metric': 'acc'},
"hellaswag": {'method': 'perplexity', 'metric': 'acc'},
"piqa": {'method': 'perplexity', 'metric': 'acc'}
}
Coreference = {
"winogrande": {'method': 'perplexity', 'metric': 'acc'},
"wsc": {'method': 'perplexity', 'metric': 'acc'},
"wsc273": {'method': 'perplexity', 'metric': 'acc'}
}
Paraphrase = {
"mrpc": {'method': 'perplexity', 'metric': 'acc'},
"paws": {'method': 'perplexity', 'metric': 'acc'},
"qqp": {'method': 'perplexity', 'metric': 'acc'}
}
NLI = {
"rte": {'method': 'perplexity', 'metric': 'acc'},
"snli": {'method': 'perplexity', 'metric': 'acc'},
"mnli_m": {'method': 'perplexity', 'metric': 'acc'},
"mnli_mm": {'method': 'perplexity', 'metric': 'acc'},
"qnli": {'method': 'perplexity', 'metric': 'acc'}
}
ReadingComp = {
"multirc": {'method': 'perplexity', 'metric': 'f1'},
"openbookqa": {'method': 'perplexity', 'metric': 'acc'},
"boolq": {'method': 'perplexity', 'metric': 'acc'},
"squad_v1": {'method': 'generation', 'metric': 'em'}
}
Sentiment = {
"sentiment140": {'method': 'perplexity', 'metric': 'acc'},
"sst2": {'method': 'perplexity', 'metric': 'acc'},
"yelp": {'method': 'perplexity', 'metric': 'acc'}
}
Data2Text = {
"common_gen": {'method': 'generation', 'metric': 'rl'},
"e2e_nlg": {'method': 'generation', 'metric': 'rl'},
"dart": {'method': 'generation', 'metric': 'rl'}
}
Summarize = {
"aeslc": {'method': 'generation', 'metric': 'rl'},
"ag_news": {'method': 'perplexity', 'metric': 'acc'},
"gigaword": {'method': 'generation', 'metric': 'rl'}
}
Usage Examples
Evaluate with Dense Retrieval
# Run from command line
python research/llm_embedder/evaluation/eval_icl.py \
--eval_data llm-embedder:icl/icl/test.json \
--corpus llm-embedder:icl/icl/corpus.json \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--query_encoder BAAI/llm-embedder \
--retrieval_method dense \
--few_shot 8 \
--context_max_length 1024 \
--output_dir data/results/icl \
--lm_batch_size 4
Evaluate Specific Tasks
from research.llm_embedder.evaluation.eval_icl import main
import sys
# Only evaluate NLI tasks
sys.argv = [
'eval_icl.py',
'--task_names', 'rte', 'snli', 'mnli_m',
'--eval_data', 'data/icl/test.json',
'--model_name_or_path', 'meta-llama/Llama-2-7b-hf',
'--retrieval_method', 'no',
'--few_shot', '0' # Zero-shot
]
main()
Compare Retrieval Methods
# Evaluate multiple retrieval methods
methods = ['dense', 'bm25', 'random', 'same-task-random', 'no']
results = {}
for method in methods:
sys.argv = [
'eval_icl.py',
'--retrieval_method', method,
'--few_shot', '8',
'--output_dir', f'results/{method}'
]
main()
# Results saved to results/{method}/