Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding LLM Embedder ICL Utils

From Leeroopedia
Revision as of 14:59, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/FlagOpen_FlagEmbedding_LLM_Embedder_ICL_Utils.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains In_Context_Learning, Evaluation_Metrics, Natural_Language_Processing
Last Updated 2026-02-09 00:00 GMT

Overview

Utility functions for in-context learning evaluation including text normalization, QA metrics (EM/F1), and ROUGE scoring.

Description

This module provides comprehensive evaluation utilities for in-context learning tasks:

Text normalization: The _normalize_answer() function handles SQuAD-style normalization (lowercase, article removal, punctuation handling, whitespace normalization) to enable fair comparison of predictions and ground truth.

QA metrics: Implements exact match and token-level F1 score computation with support for multiple ground truths per question. The qa_metrics() function computes maximum scores across all valid answers.

ROUGE metrics: Wrapper around the rouge library for computing ROUGE-1, ROUGE-2, and ROUGE-L scores for generation quality assessment.

Additional metrics: Simple accuracy for classification and macro F1 for binary classification tasks.

ICL-specific functions: flat_options() and perplexity_to_choice() handle multiple-choice evaluation by converting between option lists and perplexity-based selection. _llm_generation_func() and _llm_perplexity_func() prepare ICL prompts with dynamic few-shot example selection based on token budget.

Usage

Use these utilities for evaluating in-context learning performance on QA, generation, and classification tasks with standard metrics.

Code Reference

Source Location

Signature

def normalize_squad(answer)
def qa_metrics(targets, predictions, return_list=False)
def rouge(preds, labels, return_list=False)
def compute_metrics(metric, labels, preds)
def compute_scores(metric, preds, labels)

Import

from research.llm_embedder.evaluation.icl_utils import compute_metrics, qa_metrics, rouge

I/O Contract

Inputs

Name Type Required Description
targets List[List[str]] Yes List of answer lists (multiple answers per question)
predictions List[str] Yes Model predictions
metric str Yes Metric name: "em", "f1", "acc", "rl" (ROUGE-L)
labels List Yes Ground truth labels
preds List Yes Predictions to evaluate

Outputs

Name Type Description
em float Exact match score (0-1)
f1 float Token-level F1 score (0-1)
rouge scores Dict Dictionary with r1, r2, rl scores
metrics Dict Dictionary with requested metric scores

Usage Examples

from research.llm_embedder.evaluation.icl_utils import compute_metrics, qa_metrics

# Exact match and F1 for QA
targets = [["Paris", "paris"], ["London"]]
predictions = ["paris", "london"]
em, f1 = qa_metrics(targets, predictions)
print(f"EM: {em:.3f}, F1: {f1:.3f}")  # EM: 1.000, F1: 1.000

# ROUGE scores for generation
from research.llm_embedder.evaluation.icl_utils import rouge
preds = ["The cat sat on the mat.", "Dogs are loyal pets."]
labels = ["A cat was sitting on a mat.", "Dogs are very loyal animals."]
r1, r2, rl = rouge(preds, labels)
print(f"ROUGE-L: {rl:.3f}")  # ROUGE-L: 0.756

# General metrics computation
metrics = compute_metrics("em", labels=targets, preds=predictions)
# {"em": 1.0}

# Simple accuracy
labels = ["yes", "no", "yes"]
preds = ["yes", "yes", "yes"]
metrics = compute_metrics("acc", labels=labels, preds=preds)
# {"acc": 0.667}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment