Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:FlagOpen FlagEmbedding LLM Embedder Eval ICL

From Leeroopedia


Knowledge Sources
Domains Natural Language Processing, In-Context Learning, Model Evaluation
Last Updated 2026-02-09 00:00 GMT

Overview

An evaluation framework for measuring language model performance on in-context learning tasks across 30 diverse datasets spanning 9 task categories.

Description

This module evaluates language models on 30 in-context learning (ICL) tasks organized into 9 categories: Closed-book QA (CQA), Commonsense reasoning, Coreference resolution, Paraphrase detection, Natural Language Inference (NLI), Reading Comprehension, Sentiment Analysis, Data-to-Text generation, and Summarization.

The evaluation supports two modes: perplexity-based (for classification tasks) and generation-based (for open-ended tasks). It integrates with retrieval systems to provide few-shot examples (0-8 shots) retrieved via dense retrieval, BM25, or random selection. Tasks include ARC, HellaSwag, COPA, WinoGrande, MRPC, SNLI, SQuAD, MultiRC, SST-2, CommonGen, and Gigaword.

The framework supports multiple retrieval methods (dense, BM25, random, same-task-random, or no retrieval), computes task-specific metrics (accuracy, F1, exact match, ROUGE-L), and aggregates results by category and overall average.

Usage

Use this module to evaluate how well language models leverage in-context examples for diverse NLP tasks, to compare different retrieval strategies for ICL example selection, or to benchmark model performance across multiple reasoning and generation capabilities.

Code Reference

Source Location

Signature

def main():
    """Main evaluation function for ICL tasks"""

def load_test_data(
    knn_inxs, test_data, corpus_data,
    filter_diff_task: bool = False, example_num: int = 8,
    same_task_random: bool = False
) -> Dict[str, List]:
    """Load test data with retrieved few-shot examples"""

Import

from research.llm_embedder.evaluation.eval_icl import main, load_test_data

I/O Contract

Inputs

Name Type Required Description
eval_data str Yes Test data JSON file path
corpus str Yes Corpus for retrieval
model_name_or_path str Yes LM to evaluate
retrieval_method str No dense/bm25/random/no (default: dense)
few_shot int No Number of examples (default: 8)
context_max_length int No Max context length (default: 1024)

Outputs

Name Type Description
metrics dict Per-category and overall scores
results dict Per-task detailed results saved to output_dir

Task Categories

Supported Tasks

# 9 Categories, 30 Tasks:

CQA = {
    "arc_c": {'method': 'perplexity', 'metric': 'acc'},
    "arc_e": {'method': 'perplexity', 'metric': 'acc'},
    "natural_questions": {'method': 'generation', 'metric': 'em'}
}

Commonsense = {
    "copa": {'method': 'perplexity', 'metric': 'acc'},
    "hellaswag": {'method': 'perplexity', 'metric': 'acc'},
    "piqa": {'method': 'perplexity', 'metric': 'acc'}
}

Coreference = {
    "winogrande": {'method': 'perplexity', 'metric': 'acc'},
    "wsc": {'method': 'perplexity', 'metric': 'acc'},
    "wsc273": {'method': 'perplexity', 'metric': 'acc'}
}

Paraphrase = {
    "mrpc": {'method': 'perplexity', 'metric': 'acc'},
    "paws": {'method': 'perplexity', 'metric': 'acc'},
    "qqp": {'method': 'perplexity', 'metric': 'acc'}
}

NLI = {
    "rte": {'method': 'perplexity', 'metric': 'acc'},
    "snli": {'method': 'perplexity', 'metric': 'acc'},
    "mnli_m": {'method': 'perplexity', 'metric': 'acc'},
    "mnli_mm": {'method': 'perplexity', 'metric': 'acc'},
    "qnli": {'method': 'perplexity', 'metric': 'acc'}
}

ReadingComp = {
    "multirc": {'method': 'perplexity', 'metric': 'f1'},
    "openbookqa": {'method': 'perplexity', 'metric': 'acc'},
    "boolq": {'method': 'perplexity', 'metric': 'acc'},
    "squad_v1": {'method': 'generation', 'metric': 'em'}
}

Sentiment = {
    "sentiment140": {'method': 'perplexity', 'metric': 'acc'},
    "sst2": {'method': 'perplexity', 'metric': 'acc'},
    "yelp": {'method': 'perplexity', 'metric': 'acc'}
}

Data2Text = {
    "common_gen": {'method': 'generation', 'metric': 'rl'},
    "e2e_nlg": {'method': 'generation', 'metric': 'rl'},
    "dart": {'method': 'generation', 'metric': 'rl'}
}

Summarize = {
    "aeslc": {'method': 'generation', 'metric': 'rl'},
    "ag_news": {'method': 'perplexity', 'metric': 'acc'},
    "gigaword": {'method': 'generation', 'metric': 'rl'}
}

Usage Examples

Evaluate with Dense Retrieval

# Run from command line
python research/llm_embedder/evaluation/eval_icl.py \
    --eval_data llm-embedder:icl/icl/test.json \
    --corpus llm-embedder:icl/icl/corpus.json \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --query_encoder BAAI/llm-embedder \
    --retrieval_method dense \
    --few_shot 8 \
    --context_max_length 1024 \
    --output_dir data/results/icl \
    --lm_batch_size 4

Evaluate Specific Tasks

from research.llm_embedder.evaluation.eval_icl import main
import sys

# Only evaluate NLI tasks
sys.argv = [
    'eval_icl.py',
    '--task_names', 'rte', 'snli', 'mnli_m',
    '--eval_data', 'data/icl/test.json',
    '--model_name_or_path', 'meta-llama/Llama-2-7b-hf',
    '--retrieval_method', 'no',
    '--few_shot', '0'  # Zero-shot
]

main()

Compare Retrieval Methods

# Evaluate multiple retrieval methods
methods = ['dense', 'bm25', 'random', 'same-task-random', 'no']
results = {}

for method in methods:
    sys.argv = [
        'eval_icl.py',
        '--retrieval_method', method,
        '--few_shot', '8',
        '--output_dir', f'results/{method}'
    ]
    main()
    # Results saved to results/{method}/

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment