Implementation:FlagOpen FlagEmbedding LLM Embedder Eval ICL

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Natural Language Processing, In-Context Learning, Model Evaluation
Last Updated	2026-02-09 00:00 GMT

Overview

An evaluation framework for measuring language model performance on in-context learning tasks across 30 diverse datasets spanning 9 task categories.

Description

This module evaluates language models on 30 in-context learning (ICL) tasks organized into 9 categories: Closed-book QA (CQA), Commonsense reasoning, Coreference resolution, Paraphrase detection, Natural Language Inference (NLI), Reading Comprehension, Sentiment Analysis, Data-to-Text generation, and Summarization.

The evaluation supports two modes: perplexity-based (for classification tasks) and generation-based (for open-ended tasks). It integrates with retrieval systems to provide few-shot examples (0-8 shots) retrieved via dense retrieval, BM25, or random selection. Tasks include ARC, HellaSwag, COPA, WinoGrande, MRPC, SNLI, SQuAD, MultiRC, SST-2, CommonGen, and Gigaword.

The framework supports multiple retrieval methods (dense, BM25, random, same-task-random, or no retrieval), computes task-specific metrics (accuracy, F1, exact match, ROUGE-L), and aggregates results by category and overall average.

Usage

Use this module to evaluate how well language models leverage in-context examples for diverse NLP tasks, to compare different retrieval strategies for ICL example selection, or to benchmark model performance across multiple reasoning and generation capabilities.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/llm_embedder/evaluation/eval_icl.py
Lines: 1-355

Signature

def main():
    """Main evaluation function for ICL tasks"""

def load_test_data(
    knn_inxs, test_data, corpus_data,
    filter_diff_task: bool = False, example_num: int = 8,
    same_task_random: bool = False
) -> Dict[str, List]:
    """Load test data with retrieved few-shot examples"""

Import

from research.llm_embedder.evaluation.eval_icl import main, load_test_data

I/O Contract

Inputs

Name	Type	Required	Description
eval_data	str	Yes	Test data JSON file path
corpus	str	Yes	Corpus for retrieval
model_name_or_path	str	Yes	LM to evaluate
retrieval_method	str	No	dense/bm25/random/no (default: dense)
few_shot	int	No	Number of examples (default: 8)
context_max_length	int	No	Max context length (default: 1024)

Outputs

Name	Type	Description
metrics	dict	Per-category and overall scores
results	dict	Per-task detailed results saved to output_dir

Task Categories

Supported Tasks

# 9 Categories, 30 Tasks:

CQA = {
    "arc_c": {'method': 'perplexity', 'metric': 'acc'},
    "arc_e": {'method': 'perplexity', 'metric': 'acc'},
    "natural_questions": {'method': 'generation', 'metric': 'em'}
}

Commonsense = {
    "copa": {'method': 'perplexity', 'metric': 'acc'},
    "hellaswag": {'method': 'perplexity', 'metric': 'acc'},
    "piqa": {'method': 'perplexity', 'metric': 'acc'}
}

Coreference = {
    "winogrande": {'method': 'perplexity', 'metric': 'acc'},
    "wsc": {'method': 'perplexity', 'metric': 'acc'},
    "wsc273": {'method': 'perplexity', 'metric': 'acc'}
}

Paraphrase = {
    "mrpc": {'method': 'perplexity', 'metric': 'acc'},
    "paws": {'method': 'perplexity', 'metric': 'acc'},
    "qqp": {'method': 'perplexity', 'metric': 'acc'}
}

NLI = {
    "rte": {'method': 'perplexity', 'metric': 'acc'},
    "snli": {'method': 'perplexity', 'metric': 'acc'},
    "mnli_m": {'method': 'perplexity', 'metric': 'acc'},
    "mnli_mm": {'method': 'perplexity', 'metric': 'acc'},
    "qnli": {'method': 'perplexity', 'metric': 'acc'}
}

ReadingComp = {
    "multirc": {'method': 'perplexity', 'metric': 'f1'},
    "openbookqa": {'method': 'perplexity', 'metric': 'acc'},
    "boolq": {'method': 'perplexity', 'metric': 'acc'},
    "squad_v1": {'method': 'generation', 'metric': 'em'}
}

Sentiment = {
    "sentiment140": {'method': 'perplexity', 'metric': 'acc'},
    "sst2": {'method': 'perplexity', 'metric': 'acc'},
    "yelp": {'method': 'perplexity', 'metric': 'acc'}
}

Data2Text = {
    "common_gen": {'method': 'generation', 'metric': 'rl'},
    "e2e_nlg": {'method': 'generation', 'metric': 'rl'},
    "dart": {'method': 'generation', 'metric': 'rl'}
}

Summarize = {
    "aeslc": {'method': 'generation', 'metric': 'rl'},
    "ag_news": {'method': 'perplexity', 'metric': 'acc'},
    "gigaword": {'method': 'generation', 'metric': 'rl'}
}

Usage Examples

Evaluate with Dense Retrieval

# Run from command line
python research/llm_embedder/evaluation/eval_icl.py \
    --eval_data llm-embedder:icl/icl/test.json \
    --corpus llm-embedder:icl/icl/corpus.json \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --query_encoder BAAI/llm-embedder \
    --retrieval_method dense \
    --few_shot 8 \
    --context_max_length 1024 \
    --output_dir data/results/icl \
    --lm_batch_size 4

Evaluate Specific Tasks

from research.llm_embedder.evaluation.eval_icl import main
import sys

# Only evaluate NLI tasks
sys.argv = [
    'eval_icl.py',
    '--task_names', 'rte', 'snli', 'mnli_m',
    '--eval_data', 'data/icl/test.json',
    '--model_name_or_path', 'meta-llama/Llama-2-7b-hf',
    '--retrieval_method', 'no',
    '--few_shot', '0'  # Zero-shot
]

main()

Compare Retrieval Methods

# Evaluate multiple retrieval methods
methods = ['dense', 'bm25', 'random', 'same-task-random', 'no']
results = {}

for method in methods:
    sys.argv = [
        'eval_icl.py',
        '--retrieval_method', method,
        '--few_shot', '8',
        '--output_dir', f'results/{method}'
    ]
    main()
    # Results saved to results/{method}/

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment