Implementation:Microsoft DeepSpeedExamples Squad Evaluation

Knowledge Sources	Microsoft_DeepSpeedExamples
Domains	Natural Language Processing, Evaluation
Last Updated	2026-02-07 12:00 GMT

Overview

Official evaluation script for SQuAD v1.1 that computes F1 score and exact match metrics between model predictions and ground-truth answers.

Description

This module implements the standard SQuAD v1.1 evaluation protocol. The core evaluate function loads a SQuAD dataset JSON file and a predictions JSON file, then iterates over every question-answer pair to compute two metrics: exact_match (whether the normalized prediction exactly equals any ground-truth answer) and F1 (token-level overlap between prediction and best-matching ground-truth answer). Both scores are averaged across all questions and reported as percentages.

Answer normalization is handled by the normalize_answer function, which applies a four-step pipeline: lowercasing, punctuation removal, article removal (a, an, the), and whitespace normalization. The f1_score function computes token-level precision and recall using Python Counter intersection, while exact_match_score checks string equality after normalization. The metric_max_over_ground_truths helper ensures each prediction is scored against the best of multiple valid ground-truth answers.

The module verifies that the dataset version matches the expected version and logs warnings for any unanswered questions, which receive a score of zero. This is the same evaluation logic used by the official SQuAD leaderboard.

Usage

Use this module to evaluate SQuAD question-answering model predictions. Call the evaluate function with the expected version string, the path to the SQuAD dataset JSON, and the path to the model predictions JSON to obtain exact match and F1 scores.

Code Reference

Source Location

Repository: Microsoft_DeepSpeedExamples
File: training/BingBertSquad/evaluate.py
Lines: 1-85

Signature

def normalize_answer(s) -> str:
def f1_score(prediction, ground_truth) -> float:
def exact_match_score(prediction, ground_truth) -> bool:
def metric_max_over_ground_truths(metric_fn, prediction, ground_truths) -> float:
def evaluate(expected_version, ds_file, pred_file) -> dict:

Import

from evaluate import evaluate, f1_score, exact_match_score, normalize_answer

I/O Contract

Inputs

Name	Type	Required	Description
expected_version	str	Yes	Expected SQuAD dataset version string (e.g., '1.1')
ds_file	str	Yes	Path to the SQuAD dataset JSON file
pred_file	str	Yes	Path to the predictions JSON file mapping question IDs to answer strings
prediction	str	Yes (for score functions)	Predicted answer text
ground_truth	str	Yes (for score functions)	Ground-truth answer text

Outputs

Name	Type	Description
result	dict	Dictionary with keys 'exact_match' and 'f1', each a float percentage (0-100)

Usage Examples

from evaluate import evaluate

# Evaluate model predictions against SQuAD v1.1 dataset
results = evaluate(
    expected_version='1.1',
    ds_file='data/dev-v1.1.json',
    pred_file='output/predictions.json'
)
print(f"Exact Match: {results['exact_match']:.2f}%")
print(f"F1 Score: {results['f1']:.2f}%")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment