Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft DeepSpeedExamples Squad Evaluation

From Leeroopedia
Revision as of 15:42, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Microsoft_DeepSpeedExamples_Squad_Evaluation.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Natural Language Processing, Evaluation
Last Updated 2026-02-07 12:00 GMT

Overview

Official evaluation script for SQuAD v1.1 that computes F1 score and exact match metrics between model predictions and ground-truth answers.

Description

This module implements the standard SQuAD v1.1 evaluation protocol. The core evaluate function loads a SQuAD dataset JSON file and a predictions JSON file, then iterates over every question-answer pair to compute two metrics: exact_match (whether the normalized prediction exactly equals any ground-truth answer) and F1 (token-level overlap between prediction and best-matching ground-truth answer). Both scores are averaged across all questions and reported as percentages.

Answer normalization is handled by the normalize_answer function, which applies a four-step pipeline: lowercasing, punctuation removal, article removal (a, an, the), and whitespace normalization. The f1_score function computes token-level precision and recall using Python Counter intersection, while exact_match_score checks string equality after normalization. The metric_max_over_ground_truths helper ensures each prediction is scored against the best of multiple valid ground-truth answers.

The module verifies that the dataset version matches the expected version and logs warnings for any unanswered questions, which receive a score of zero. This is the same evaluation logic used by the official SQuAD leaderboard.

Usage

Use this module to evaluate SQuAD question-answering model predictions. Call the evaluate function with the expected version string, the path to the SQuAD dataset JSON, and the path to the model predictions JSON to obtain exact match and F1 scores.

Code Reference

Source Location

Signature

def normalize_answer(s) -> str:
def f1_score(prediction, ground_truth) -> float:
def exact_match_score(prediction, ground_truth) -> bool:
def metric_max_over_ground_truths(metric_fn, prediction, ground_truths) -> float:
def evaluate(expected_version, ds_file, pred_file) -> dict:

Import

from evaluate import evaluate, f1_score, exact_match_score, normalize_answer

I/O Contract

Inputs

Name Type Required Description
expected_version str Yes Expected SQuAD dataset version string (e.g., '1.1')
ds_file str Yes Path to the SQuAD dataset JSON file
pred_file str Yes Path to the predictions JSON file mapping question IDs to answer strings
prediction str Yes (for score functions) Predicted answer text
ground_truth str Yes (for score functions) Ground-truth answer text

Outputs

Name Type Description
result dict Dictionary with keys 'exact_match' and 'f1', each a float percentage (0-100)

Usage Examples

from evaluate import evaluate

# Evaluate model predictions against SQuAD v1.1 dataset
results = evaluate(
    expected_version='1.1',
    ds_file='data/dev-v1.1.json',
    pred_file='output/predictions.json'
)
print(f"Exact Match: {results['exact_match']:.2f}%")
print(f"F1 Score: {results['f1']:.2f}%")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment