Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Openai Whisper Transcribe Timestamp Parsing

From Leeroopedia
Revision as of 13:42, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Openai_Whisper_Transcribe_Timestamp_Parsing.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

This is a Pattern Doc. The timestamp parsing logic is not a standalone function or class but rather inline code within the transcribe() function at whisper/transcribe.py:L339-399. It parses interleaved text and timestamp tokens from the decoder output into structured, time-aligned text segments.

Source

Pattern Interface

Input

  • Decoded tokens from DecodingTask, containing interleaved text tokens and timestamp tokens
  • Current seek position (in mel spectrogram frames)

Output

  • List[dict] — a list of segment dictionaries, each with keys: seek, start, end, text, tokens, temperature, avg_logprob, compression_ratio, no_speech_prob

Timestamp Token Identification

Timestamp tokens are identified by their token ID. Any token with an ID greater than or equal to tokenizer.timestamp_begin is a timestamp token. The actual time value is computed as:

time = (token_id - tokenizer.timestamp_begin) * time_precision

Where time_precision = 0.02 seconds (20 milliseconds per frame).

The new_segment Helper

The new_segment function (defined at L246-261) creates a segment dictionary from the parsed components:

def new_segment(*, start, end, tokens, result):
    # creates dict with:
    #   seek: current seek position
    #   start: absolute start time in seconds
    #   end: absolute end time in seconds
    #   text: tokenizer.decode(tokens)
    #   tokens: list of token IDs
    #   temperature: result.temperature
    #   avg_logprob: result.avg_logprob
    #   compression_ratio: result.compression_ratio
    #   no_speech_prob: result.no_speech_prob

This helper bundles the segment's text content with its timing information and decoding metadata (temperature, log probability, compression ratio, no-speech probability) into a single dictionary.

Parsing Algorithm

The timestamp parsing logic scans through the decoded token sequence looking for timestamp token pairs that bracket text segments:

# Inside transcribe() loop — pseudocode representation
for token in decoded_tokens:
    if token >= tokenizer.timestamp_begin:
        timestamp = (token - tokenizer.timestamp_begin) * time_precision
        if is_start_timestamp:
            current_start = timestamp
        else:
            current_end = timestamp
            segments.append(new_segment(start=current_start, end=current_end, ...))

Detailed Steps

  1. Scan through the decoded token list from left to right.
  2. Identify timestamp tokens by checking if token_id >= tokenizer.timestamp_begin.
  3. Determine role: Based on position in the alternating pattern, determine whether a timestamp token marks a segment start or end.
  4. Extract text tokens: Collect all non-timestamp tokens between a start/end timestamp pair.
  5. Compute absolute times: Add the seek-based offset to convert window-relative timestamps to absolute audio timestamps:
    • start = seek_time + relative_start
    • end = seek_time + relative_end
  6. Create segment: Call new_segment() with the computed start, end, tokens, and the DecodingResult metadata.
  7. Append the segment to the output list.

Seek Advancement

After parsing all timestamp tokens from a decoded segment, the seek pointer advances to the position of the last timestamp found. This determines where the next 30-second window starts:

  • If the last timestamp is at 20.0 seconds within the window, the seek advances by 20.0 seconds worth of frames (1000 frames).
  • If no timestamps were produced, the seek advances by the full 30-second window (1500 frames).

Handling Special Cases

  • No timestamp tokens: When without_timestamps=True was set in DecodingOptions, the entire decoded output is treated as one segment spanning the full window.
  • Consecutive timestamps with no text: These represent silence or pauses and advance the seek without creating a text segment.
  • Single unpaired timestamp: If decoding ends with only a start timestamp and no matching end, the segment extends to the end of the decoded content.

Context in the Pipeline

This parsing logic runs inside the main transcribe() loop, after each call to decode():

  1. transcribe() extracts a 30-second mel window
  2. decode() produces a DecodingResult with tokens
  3. Timestamp parsing (this pattern) converts tokens to segments
  4. Segments are accumulated into the final result list
  5. The seek pointer advances based on the parsed timestamps

See Also

2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment