Implementation:Openai Whisper Meanwhile Dataset

Knowledge Sources	Openai_Whisper
Domains	Speech_Recognition, Evaluation
Last Updated	2026-02-13 22:00 GMT

Overview

Evaluation dataset containing ground-truth transcripts of The Late Show with Stephen Colbert's "Meanwhile" comedy segments, used for benchmarking long-form speech recognition accuracy.

Description

The Meanwhile Dataset is a JSON-formatted evaluation resource containing 64 labeled speech segments from YouTube. Each entry maps a YouTube video ID to a segment with precise start/end timestamps and verbatim transcript text. The transcripts are in uppercase and contain challenging vocabulary, unusual phrasing, and rapid speech patterns typical of late-night comedy monologues.

This dataset is used to evaluate Whisper's transcription accuracy on real-world, long-form English speech that includes colloquialisms, proper nouns, and creative language.

Usage

Use this dataset when benchmarking Whisper model transcription accuracy on English conversational speech. Load the JSON file and compare model output against ground-truth transcripts to compute Word Error Rate (WER) or other evaluation metrics.

Code Reference

Source Location

Repository: Openai_Whisper
File: data/meanwhile.json
Lines: 1-322

Schema

{
    "<youtube_video_id>": {
        "begin": "<start_timestamp>",
        "end": "<end_timestamp>",
        "text": "<ground_truth_transcript>"
    }
}

Import

import json
import os

dataset_path = os.path.join(os.path.dirname(__file__), "..", "data", "meanwhile.json")
with open(dataset_path) as f:
    meanwhile = json.load(f)

I/O Contract

Inputs

Name	Type	Required	Description
file_path	str	Yes	Path to the meanwhile.json file

Outputs

Name	Type	Description
dataset	Dict[str, Dict]	Mapping of YouTube video IDs to segment metadata
segment.begin	str	Start timestamp in "M:SS.S" format
segment.end	str	End timestamp in "M:SS.S" format
segment.text	str	Ground-truth transcript text (uppercase)

Usage Examples

Loading and Iterating

import json

# Load the evaluation dataset
with open("data/meanwhile.json") as f:
    dataset = json.load(f)

# Iterate over segments
for video_id, segment in dataset.items():
    print(f"Video: {video_id}")
    print(f"  Time: {segment['begin']} - {segment['end']}")
    print(f"  Text: {segment['text'][:80]}...")

Evaluation Against Whisper Output

import json
import whisper
from whisper.normalizers import EnglishTextNormalizer

model = whisper.load_model("base")
normalizer = EnglishTextNormalizer()

with open("data/meanwhile.json") as f:
    dataset = json.load(f)

# Compare model output against ground truth for a segment
video_id = list(dataset.keys())[0]
segment = dataset[video_id]

# Transcribe the corresponding audio
result = model.transcribe(f"audio/{video_id}.wav")

# Normalize both for fair comparison
reference = normalizer(segment["text"])
hypothesis = normalizer(result["text"])

Related Pages

Environment:Openai_Whisper_PyTorch_CUDA

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment