Implementation:Openai Whisper Transcribe Timestamp Parsing

Overview

This is a Pattern Doc. The timestamp parsing logic is not a standalone function or class but rather inline code within the transcribe() function at whisper/transcribe.py:L339-399. It parses interleaved text and timestamp tokens from the decoder output into structured, time-aligned text segments.

Source

File: whisper/transcribe.py:L339-399 (timestamp parsing), whisper/transcribe.py:L246-261 (new_segment helper)
Repository: https://github.com/openai/whisper

Pattern Interface

Input

Decoded tokens from DecodingTask, containing interleaved text tokens and timestamp tokens
Current seek position (in mel spectrogram frames)

Output

List[dict] — a list of segment dictionaries, each with keys: seek, start, end, text, tokens, temperature, avg_logprob, compression_ratio, no_speech_prob

Timestamp Token Identification

Timestamp tokens are identified by their token ID. Any token with an ID greater than or equal to tokenizer.timestamp_begin is a timestamp token. The actual time value is computed as:

time = (token_id - tokenizer.timestamp_begin) * time_precision

Where time_precision = 0.02 seconds (20 milliseconds per frame).

The `new_segment` Helper

The new_segment function (defined at L246-261) creates a segment dictionary from the parsed components:

def new_segment(*, start, end, tokens, result):
    # creates dict with:
    #   seek: current seek position
    #   start: absolute start time in seconds
    #   end: absolute end time in seconds
    #   text: tokenizer.decode(tokens)
    #   tokens: list of token IDs
    #   temperature: result.temperature
    #   avg_logprob: result.avg_logprob
    #   compression_ratio: result.compression_ratio
    #   no_speech_prob: result.no_speech_prob

This helper bundles the segment's text content with its timing information and decoding metadata (temperature, log probability, compression ratio, no-speech probability) into a single dictionary.

Parsing Algorithm

The timestamp parsing logic scans through the decoded token sequence looking for timestamp token pairs that bracket text segments:

# Inside transcribe() loop — pseudocode representation
for token in decoded_tokens:
    if token >= tokenizer.timestamp_begin:
        timestamp = (token - tokenizer.timestamp_begin) * time_precision
        if is_start_timestamp:
            current_start = timestamp
        else:
            current_end = timestamp
            segments.append(new_segment(start=current_start, end=current_end, ...))

Detailed Steps

Scan through the decoded token list from left to right.
Identify timestamp tokens by checking if token_id >= tokenizer.timestamp_begin.
Determine role: Based on position in the alternating pattern, determine whether a timestamp token marks a segment start or end.
Extract text tokens: Collect all non-timestamp tokens between a start/end timestamp pair.
Compute absolute times: Add the seek-based offset to convert window-relative timestamps to absolute audio timestamps:
- start = seek_time + relative_start
- end = seek_time + relative_end
Create segment: Call new_segment() with the computed start, end, tokens, and the DecodingResult metadata.
Append the segment to the output list.

Seek Advancement

After parsing all timestamp tokens from a decoded segment, the seek pointer advances to the position of the last timestamp found. This determines where the next 30-second window starts:

If the last timestamp is at 20.0 seconds within the window, the seek advances by 20.0 seconds worth of frames (1000 frames).
If no timestamps were produced, the seek advances by the full 30-second window (1500 frames).

Handling Special Cases

No timestamp tokens: When without_timestamps=True was set in DecodingOptions, the entire decoded output is treated as one segment spanning the full window.
Consecutive timestamps with no text: These represent silence or pauses and advance the seek without creating a text segment.
Single unpaired timestamp: If decoding ends with only a start timestamp and no matching end, the segment extends to the end of the decoded content.

Context in the Pipeline

This parsing logic runs inside the main transcribe() loop, after each call to decode():

transcribe() extracts a 30-second mel window
decode() produces a DecodingResult with tokens
Timestamp parsing (this pattern) converts tokens to segments
Segments are accumulated into the final result list
The seek pointer advances based on the parsed timestamps

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment