Implementation:FlagOpen FlagEmbedding LLM Embedder ICL Utils
| Knowledge Sources | |
|---|---|
| Domains | In_Context_Learning, Evaluation_Metrics, Natural_Language_Processing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Utility functions for in-context learning evaluation including text normalization, QA metrics (EM/F1), and ROUGE scoring.
Description
This module provides comprehensive evaluation utilities for in-context learning tasks:
Text normalization: The _normalize_answer() function handles SQuAD-style normalization (lowercase, article removal, punctuation handling, whitespace normalization) to enable fair comparison of predictions and ground truth.
QA metrics: Implements exact match and token-level F1 score computation with support for multiple ground truths per question. The qa_metrics() function computes maximum scores across all valid answers.
ROUGE metrics: Wrapper around the rouge library for computing ROUGE-1, ROUGE-2, and ROUGE-L scores for generation quality assessment.
Additional metrics: Simple accuracy for classification and macro F1 for binary classification tasks.
ICL-specific functions: flat_options() and perplexity_to_choice() handle multiple-choice evaluation by converting between option lists and perplexity-based selection. _llm_generation_func() and _llm_perplexity_func() prepare ICL prompts with dynamic few-shot example selection based on token budget.
Usage
Use these utilities for evaluating in-context learning performance on QA, generation, and classification tasks with standard metrics.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/llm_embedder/evaluation/icl_utils.py
- Lines: 1-296
Signature
def normalize_squad(answer)
def qa_metrics(targets, predictions, return_list=False)
def rouge(preds, labels, return_list=False)
def compute_metrics(metric, labels, preds)
def compute_scores(metric, preds, labels)
Import
from research.llm_embedder.evaluation.icl_utils import compute_metrics, qa_metrics, rouge
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| targets | List[List[str]] | Yes | List of answer lists (multiple answers per question) |
| predictions | List[str] | Yes | Model predictions |
| metric | str | Yes | Metric name: "em", "f1", "acc", "rl" (ROUGE-L) |
| labels | List | Yes | Ground truth labels |
| preds | List | Yes | Predictions to evaluate |
Outputs
| Name | Type | Description |
|---|---|---|
| em | float | Exact match score (0-1) |
| f1 | float | Token-level F1 score (0-1) |
| rouge scores | Dict | Dictionary with r1, r2, rl scores |
| metrics | Dict | Dictionary with requested metric scores |
Usage Examples
from research.llm_embedder.evaluation.icl_utils import compute_metrics, qa_metrics
# Exact match and F1 for QA
targets = [["Paris", "paris"], ["London"]]
predictions = ["paris", "london"]
em, f1 = qa_metrics(targets, predictions)
print(f"EM: {em:.3f}, F1: {f1:.3f}") # EM: 1.000, F1: 1.000
# ROUGE scores for generation
from research.llm_embedder.evaluation.icl_utils import rouge
preds = ["The cat sat on the mat.", "Dogs are loyal pets."]
labels = ["A cat was sitting on a mat.", "Dogs are very loyal animals."]
r1, r2, rl = rouge(preds, labels)
print(f"ROUGE-L: {rl:.3f}") # ROUGE-L: 0.756
# General metrics computation
metrics = compute_metrics("em", labels=targets, preds=predictions)
# {"em": 1.0}
# Simple accuracy
labels = ["yes", "no", "yes"]
preds = ["yes", "yes", "yes"]
metrics = compute_metrics("acc", labels=labels, preds=preds)
# {"acc": 0.667}