Implementation:Openai Whisper Meanwhile Dataset
| Knowledge Sources | |
|---|---|
| Domains | Speech_Recognition, Evaluation |
| Last Updated | 2026-02-13 22:00 GMT |
Overview
Evaluation dataset containing ground-truth transcripts of The Late Show with Stephen Colbert's "Meanwhile" comedy segments, used for benchmarking long-form speech recognition accuracy.
Description
The Meanwhile Dataset is a JSON-formatted evaluation resource containing 64 labeled speech segments from YouTube. Each entry maps a YouTube video ID to a segment with precise start/end timestamps and verbatim transcript text. The transcripts are in uppercase and contain challenging vocabulary, unusual phrasing, and rapid speech patterns typical of late-night comedy monologues.
This dataset is used to evaluate Whisper's transcription accuracy on real-world, long-form English speech that includes colloquialisms, proper nouns, and creative language.
Usage
Use this dataset when benchmarking Whisper model transcription accuracy on English conversational speech. Load the JSON file and compare model output against ground-truth transcripts to compute Word Error Rate (WER) or other evaluation metrics.
Code Reference
Source Location
- Repository: Openai_Whisper
- File: data/meanwhile.json
- Lines: 1-322
Schema
{
"<youtube_video_id>": {
"begin": "<start_timestamp>",
"end": "<end_timestamp>",
"text": "<ground_truth_transcript>"
}
}
Import
import json
import os
dataset_path = os.path.join(os.path.dirname(__file__), "..", "data", "meanwhile.json")
with open(dataset_path) as f:
meanwhile = json.load(f)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| file_path | str | Yes | Path to the meanwhile.json file |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dict[str, Dict] | Mapping of YouTube video IDs to segment metadata |
| segment.begin | str | Start timestamp in "M:SS.S" format |
| segment.end | str | End timestamp in "M:SS.S" format |
| segment.text | str | Ground-truth transcript text (uppercase) |
Usage Examples
Loading and Iterating
import json
# Load the evaluation dataset
with open("data/meanwhile.json") as f:
dataset = json.load(f)
# Iterate over segments
for video_id, segment in dataset.items():
print(f"Video: {video_id}")
print(f" Time: {segment['begin']} - {segment['end']}")
print(f" Text: {segment['text'][:80]}...")
Evaluation Against Whisper Output
import json
import whisper
from whisper.normalizers import EnglishTextNormalizer
model = whisper.load_model("base")
normalizer = EnglishTextNormalizer()
with open("data/meanwhile.json") as f:
dataset = json.load(f)
# Compare model output against ground truth for a segment
video_id = list(dataset.keys())[0]
segment = dataset[video_id]
# Transcribe the corresponding audio
result = model.transcribe(f"audio/{video_id}.wav")
# Normalize both for fair comparison
reference = normalizer(segment["text"])
hypothesis = normalizer(result["text"])