Principle:Openai Whisper Output Formatting

Overview

Output Formatting is the process of serializing transcription results into standard output formats. After speech recognition produces time-aligned text segments, these segments must be rendered into formats that are compatible with downstream consumers: video players, subtitle editors, text processors, and programmatic analysis tools.

Each output format has its own conventions for representing timestamps, text content, line breaks, and metadata.

Standard Output Formats

Plain Text (TXT)

The simplest format, containing only the transcribed text without any timing or metadata. Each segment's text is written sequentially. This format is suitable when only the textual content matters and timing information is not needed.

SubRip Subtitle (SRT)

A widely supported subtitle format used by most video players and subtitle editors. Structure:

A sequential integer index starting from 1
A timestamp line in the format HH:MM:SS,mmm --> HH:MM:SS,mmm (note: comma as decimal separator)
One or more lines of subtitle text
A blank line separating entries

SRT is the most common subtitle format for consumer video content.

Web Video Text Tracks (VTT)

The W3C standard for web-based video subtitles, used with HTML5 <video> elements. Similar to SRT but with key differences:

Begins with a WEBVTT header line
Uses period (.) as decimal separator in timestamps: HH:MM:SS.mmm --> HH:MM:SS.mmm
Supports additional styling and positioning metadata (though Whisper does not use these features)

VTT is the preferred format for web applications.

Tab-Separated Values (TSV)

A machine-readable tabular format with columns for start time, end time, and text. Timestamps are expressed in milliseconds as integers. This format is suitable for importing into spreadsheets, databases, or data analysis pipelines.

JSON

A structured format containing the full transcription result including:

The complete transcribed text
The detected language
A list of segments, each with all metadata (start, end, text, tokens, probabilities)
Optionally, word-level timestamps if enabled

JSON preserves the most information and is suitable for programmatic consumption.

Timestamp Formatting

Different formats represent the same underlying timing data in different ways:

Format	Timestamp Representation	Example
SRT	`HH:MM:SS,mmm`	`00:01:23,456`
VTT	`HH:MM:SS.mmm`	`00:01:23.456`
TSV	Milliseconds (integer)	`83456`
JSON	Seconds (float)	`83.456`
TXT	(no timestamps)	N/A

Format Selection Considerations

The choice of output format depends on the intended use:

Video subtitles: SRT for broad compatibility, VTT for web applications
Data analysis: TSV for tabular tools, JSON for programmatic access
Human reading: TXT for simple text content
Archival or full-fidelity: JSON preserves all metadata including probabilities and token IDs
Multiple consumers: An "all" option can produce every format simultaneously

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment