Implementation:Microsoft DeepSpeedExamples Squad Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Evaluation |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
Official evaluation script for SQuAD v1.1 that computes F1 score and exact match metrics between model predictions and ground-truth answers.
Description
This module implements the standard SQuAD v1.1 evaluation protocol. The core evaluate function loads a SQuAD dataset JSON file and a predictions JSON file, then iterates over every question-answer pair to compute two metrics: exact_match (whether the normalized prediction exactly equals any ground-truth answer) and F1 (token-level overlap between prediction and best-matching ground-truth answer). Both scores are averaged across all questions and reported as percentages.
Answer normalization is handled by the normalize_answer function, which applies a four-step pipeline: lowercasing, punctuation removal, article removal (a, an, the), and whitespace normalization. The f1_score function computes token-level precision and recall using Python Counter intersection, while exact_match_score checks string equality after normalization. The metric_max_over_ground_truths helper ensures each prediction is scored against the best of multiple valid ground-truth answers.
The module verifies that the dataset version matches the expected version and logs warnings for any unanswered questions, which receive a score of zero. This is the same evaluation logic used by the official SQuAD leaderboard.
Usage
Use this module to evaluate SQuAD question-answering model predictions. Call the evaluate function with the expected version string, the path to the SQuAD dataset JSON, and the path to the model predictions JSON to obtain exact match and F1 scores.
Code Reference
Source Location
- Repository: Microsoft_DeepSpeedExamples
- File: training/BingBertSquad/evaluate.py
- Lines: 1-85
Signature
def normalize_answer(s) -> str:
def f1_score(prediction, ground_truth) -> float:
def exact_match_score(prediction, ground_truth) -> bool:
def metric_max_over_ground_truths(metric_fn, prediction, ground_truths) -> float:
def evaluate(expected_version, ds_file, pred_file) -> dict:
Import
from evaluate import evaluate, f1_score, exact_match_score, normalize_answer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| expected_version | str | Yes | Expected SQuAD dataset version string (e.g., '1.1') |
| ds_file | str | Yes | Path to the SQuAD dataset JSON file |
| pred_file | str | Yes | Path to the predictions JSON file mapping question IDs to answer strings |
| prediction | str | Yes (for score functions) | Predicted answer text |
| ground_truth | str | Yes (for score functions) | Ground-truth answer text |
Outputs
| Name | Type | Description |
|---|---|---|
| result | dict | Dictionary with keys 'exact_match' and 'f1', each a float percentage (0-100) |
Usage Examples
from evaluate import evaluate
# Evaluate model predictions against SQuAD v1.1 dataset
results = evaluate(
expected_version='1.1',
ds_file='data/dev-v1.1.json',
pred_file='output/predictions.json'
)
print(f"Exact Match: {results['exact_match']:.2f}%")
print(f"F1 Score: {results['f1']:.2f}%")