Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Openai Whisper Output Formatting

From Leeroopedia

Overview

Output Formatting is the process of serializing transcription results into standard output formats. After speech recognition produces time-aligned text segments, these segments must be rendered into formats that are compatible with downstream consumers: video players, subtitle editors, text processors, and programmatic analysis tools.

Each output format has its own conventions for representing timestamps, text content, line breaks, and metadata.

Standard Output Formats

Plain Text (TXT)

The simplest format, containing only the transcribed text without any timing or metadata. Each segment's text is written sequentially. This format is suitable when only the textual content matters and timing information is not needed.

SubRip Subtitle (SRT)

A widely supported subtitle format used by most video players and subtitle editors. Structure:

  • A sequential integer index starting from 1
  • A timestamp line in the format HH:MM:SS,mmm --> HH:MM:SS,mmm (note: comma as decimal separator)
  • One or more lines of subtitle text
  • A blank line separating entries

SRT is the most common subtitle format for consumer video content.

Web Video Text Tracks (VTT)

The W3C standard for web-based video subtitles, used with HTML5 <video> elements. Similar to SRT but with key differences:

  • Begins with a WEBVTT header line
  • Uses period (.) as decimal separator in timestamps: HH:MM:SS.mmm --> HH:MM:SS.mmm
  • Supports additional styling and positioning metadata (though Whisper does not use these features)

VTT is the preferred format for web applications.

Tab-Separated Values (TSV)

A machine-readable tabular format with columns for start time, end time, and text. Timestamps are expressed in milliseconds as integers. This format is suitable for importing into spreadsheets, databases, or data analysis pipelines.

JSON

A structured format containing the full transcription result including:

  • The complete transcribed text
  • The detected language
  • A list of segments, each with all metadata (start, end, text, tokens, probabilities)
  • Optionally, word-level timestamps if enabled

JSON preserves the most information and is suitable for programmatic consumption.

Timestamp Formatting

Different formats represent the same underlying timing data in different ways:

Format Timestamp Representation Example
SRT HH:MM:SS,mmm 00:01:23,456
VTT HH:MM:SS.mmm 00:01:23.456
TSV Milliseconds (integer) 83456
JSON Seconds (float) 83.456
TXT (no timestamps) N/A

Format Selection Considerations

The choice of output format depends on the intended use:

  • Video subtitles: SRT for broad compatibility, VTT for web applications
  • Data analysis: TSV for tabular tools, JSON for programmatic access
  • Human reading: TXT for simple text content
  • Archival or full-fidelity: JSON preserves all metadata including probabilities and token IDs
  • Multiple consumers: An "all" option can produce every format simultaneously

See Also

2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment