Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Openai Whisper Meanwhile Dataset

From Leeroopedia
Knowledge Sources
Domains Speech_Recognition, Evaluation
Last Updated 2026-02-13 22:00 GMT

Overview

Evaluation dataset containing ground-truth transcripts of The Late Show with Stephen Colbert's "Meanwhile" comedy segments, used for benchmarking long-form speech recognition accuracy.

Description

The Meanwhile Dataset is a JSON-formatted evaluation resource containing 64 labeled speech segments from YouTube. Each entry maps a YouTube video ID to a segment with precise start/end timestamps and verbatim transcript text. The transcripts are in uppercase and contain challenging vocabulary, unusual phrasing, and rapid speech patterns typical of late-night comedy monologues.

This dataset is used to evaluate Whisper's transcription accuracy on real-world, long-form English speech that includes colloquialisms, proper nouns, and creative language.

Usage

Use this dataset when benchmarking Whisper model transcription accuracy on English conversational speech. Load the JSON file and compare model output against ground-truth transcripts to compute Word Error Rate (WER) or other evaluation metrics.

Code Reference

Source Location

Schema

{
    "<youtube_video_id>": {
        "begin": "<start_timestamp>",
        "end": "<end_timestamp>",
        "text": "<ground_truth_transcript>"
    }
}

Import

import json
import os

dataset_path = os.path.join(os.path.dirname(__file__), "..", "data", "meanwhile.json")
with open(dataset_path) as f:
    meanwhile = json.load(f)

I/O Contract

Inputs

Name Type Required Description
file_path str Yes Path to the meanwhile.json file

Outputs

Name Type Description
dataset Dict[str, Dict] Mapping of YouTube video IDs to segment metadata
segment.begin str Start timestamp in "M:SS.S" format
segment.end str End timestamp in "M:SS.S" format
segment.text str Ground-truth transcript text (uppercase)

Usage Examples

Loading and Iterating

import json

# Load the evaluation dataset
with open("data/meanwhile.json") as f:
    dataset = json.load(f)

# Iterate over segments
for video_id, segment in dataset.items():
    print(f"Video: {video_id}")
    print(f"  Time: {segment['begin']} - {segment['end']}")
    print(f"  Text: {segment['text'][:80]}...")

Evaluation Against Whisper Output

import json
import whisper
from whisper.normalizers import EnglishTextNormalizer

model = whisper.load_model("base")
normalizer = EnglishTextNormalizer()

with open("data/meanwhile.json") as f:
    dataset = json.load(f)

# Compare model output against ground truth for a segment
video_id = list(dataset.keys())[0]
segment = dataset[video_id]

# Transcribe the corresponding audio
result = model.transcribe(f"audio/{video_id}.wav")

# Normalize both for fair comparison
reference = normalizer(segment["text"])
hypothesis = normalizer(result["text"])

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment