Principle:Openai Whisper Output Formatting
Overview
Output Formatting is the process of serializing transcription results into standard output formats. After speech recognition produces time-aligned text segments, these segments must be rendered into formats that are compatible with downstream consumers: video players, subtitle editors, text processors, and programmatic analysis tools.
Each output format has its own conventions for representing timestamps, text content, line breaks, and metadata.
Standard Output Formats
Plain Text (TXT)
The simplest format, containing only the transcribed text without any timing or metadata. Each segment's text is written sequentially. This format is suitable when only the textual content matters and timing information is not needed.
SubRip Subtitle (SRT)
A widely supported subtitle format used by most video players and subtitle editors. Structure:
- A sequential integer index starting from 1
- A timestamp line in the format
HH:MM:SS,mmm --> HH:MM:SS,mmm(note: comma as decimal separator) - One or more lines of subtitle text
- A blank line separating entries
SRT is the most common subtitle format for consumer video content.
Web Video Text Tracks (VTT)
The W3C standard for web-based video subtitles, used with HTML5 <video> elements. Similar to SRT but with key differences:
- Begins with a
WEBVTTheader line - Uses period (.) as decimal separator in timestamps:
HH:MM:SS.mmm --> HH:MM:SS.mmm - Supports additional styling and positioning metadata (though Whisper does not use these features)
VTT is the preferred format for web applications.
Tab-Separated Values (TSV)
A machine-readable tabular format with columns for start time, end time, and text. Timestamps are expressed in milliseconds as integers. This format is suitable for importing into spreadsheets, databases, or data analysis pipelines.
JSON
A structured format containing the full transcription result including:
- The complete transcribed text
- The detected language
- A list of segments, each with all metadata (start, end, text, tokens, probabilities)
- Optionally, word-level timestamps if enabled
JSON preserves the most information and is suitable for programmatic consumption.
Timestamp Formatting
Different formats represent the same underlying timing data in different ways:
| Format | Timestamp Representation | Example |
|---|---|---|
| SRT | HH:MM:SS,mmm |
00:01:23,456
|
| VTT | HH:MM:SS.mmm |
00:01:23.456
|
| TSV | Milliseconds (integer) | 83456
|
| JSON | Seconds (float) | 83.456
|
| TXT | (no timestamps) | N/A |
Format Selection Considerations
The choice of output format depends on the intended use:
- Video subtitles: SRT for broad compatibility, VTT for web applications
- Data analysis: TSV for tabular tools, JSON for programmatic access
- Human reading: TXT for simple text content
- Archival or full-fidelity: JSON preserves all metadata including probabilities and token IDs
- Multiple consumers: An "all" option can produce every format simultaneously