Implementation:FlagOpen FlagEmbedding LLM Embedder ICL Utils

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	In_Context_Learning, Evaluation_Metrics, Natural_Language_Processing
Last Updated	2026-02-09 00:00 GMT

Overview

Utility functions for in-context learning evaluation including text normalization, QA metrics (EM/F1), and ROUGE scoring.

Description

This module provides comprehensive evaluation utilities for in-context learning tasks:

Text normalization: The _normalize_answer() function handles SQuAD-style normalization (lowercase, article removal, punctuation handling, whitespace normalization) to enable fair comparison of predictions and ground truth.

QA metrics: Implements exact match and token-level F1 score computation with support for multiple ground truths per question. The qa_metrics() function computes maximum scores across all valid answers.

ROUGE metrics: Wrapper around the rouge library for computing ROUGE-1, ROUGE-2, and ROUGE-L scores for generation quality assessment.

Additional metrics: Simple accuracy for classification and macro F1 for binary classification tasks.

ICL-specific functions: flat_options() and perplexity_to_choice() handle multiple-choice evaluation by converting between option lists and perplexity-based selection. _llm_generation_func() and _llm_perplexity_func() prepare ICL prompts with dynamic few-shot example selection based on token budget.

Usage

Use these utilities for evaluating in-context learning performance on QA, generation, and classification tasks with standard metrics.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/llm_embedder/evaluation/icl_utils.py
Lines: 1-296

Signature

def normalize_squad(answer)
def qa_metrics(targets, predictions, return_list=False)
def rouge(preds, labels, return_list=False)
def compute_metrics(metric, labels, preds)
def compute_scores(metric, preds, labels)

Import

from research.llm_embedder.evaluation.icl_utils import compute_metrics, qa_metrics, rouge

I/O Contract

Inputs

Name	Type	Required	Description
targets	List[List[str]]	Yes	List of answer lists (multiple answers per question)
predictions	List[str]	Yes	Model predictions
metric	str	Yes	Metric name: "em", "f1", "acc", "rl" (ROUGE-L)
labels	List	Yes	Ground truth labels
preds	List	Yes	Predictions to evaluate

Outputs

Name	Type	Description
em	float	Exact match score (0-1)
f1	float	Token-level F1 score (0-1)
rouge scores	Dict	Dictionary with r1, r2, rl scores
metrics	Dict	Dictionary with requested metric scores

Usage Examples

from research.llm_embedder.evaluation.icl_utils import compute_metrics, qa_metrics

# Exact match and F1 for QA
targets = [["Paris", "paris"], ["London"]]
predictions = ["paris", "london"]
em, f1 = qa_metrics(targets, predictions)
print(f"EM: {em:.3f}, F1: {f1:.3f}")  # EM: 1.000, F1: 1.000

# ROUGE scores for generation
from research.llm_embedder.evaluation.icl_utils import rouge
preds = ["The cat sat on the mat.", "Dogs are loyal pets."]
labels = ["A cat was sitting on a mat.", "Dogs are very loyal animals."]
r1, r2, rl = rouge(preds, labels)
print(f"ROUGE-L: {rl:.3f}")  # ROUGE-L: 0.756

# General metrics computation
metrics = compute_metrics("em", labels=targets, preds=predictions)
# {"em": 1.0}

# Simple accuracy
labels = ["yes", "no", "yes"]
preds = ["yes", "yes", "yes"]
metrics = compute_metrics("acc", labels=labels, preds=preds)
# {"acc": 0.667}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment