Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Openai Whisper Segment Assembly

From Leeroopedia

Overview

Segment Assembly is the process of parsing decoded tokens into time-aligned text segments. Whisper's decoder produces a stream of tokens that interleaves text tokens with special timestamp tokens. The segment assembly logic interprets this interleaved stream to produce structured segments, each with a start time, end time, and text content.

This is a Pattern Doc — it describes inline logic within the transcribe() function, not a standalone API or algorithm.

Timestamp Token Design

Whisper uses a set of special tokens to encode timing information directly in the decoder's output vocabulary. These timestamp tokens represent discrete time points within the 30-second decoding window:

  • Timestamp tokens have IDs starting from a special timestamp_begin offset in the vocabulary.
  • Each token represents a time value computed as: time = (token_id - timestamp_begin) * 0.02 seconds.
  • With a precision of 0.02 seconds (20 milliseconds), the 30-second window is covered by 1500 timestamp tokens (representing times from 0.00 to 30.00 seconds).

Token Stream Structure

The decoder's output follows a specific pattern of interleaved text and timestamp tokens:

<|0.00|> text tokens <|2.40|> <|2.40|> text tokens <|5.80|> ...

The pattern works as follows:

  • A start timestamp token marks the beginning of a text segment.
  • One or more text tokens form the content of that segment.
  • An end timestamp token marks the end of the segment.
  • The end timestamp of one segment is typically equal to the start timestamp of the next segment.

Parsing Logic

The segment assembly logic scans through the decoded token sequence and identifies timestamp pairs:

  1. Iterate through all decoded tokens.
  2. When a timestamp token is encountered, determine whether it is a start or end timestamp based on its position in the alternating pattern.
  3. When a complete pair (start timestamp, text tokens, end timestamp) is found, create a new segment.
  4. Extract the text by decoding only the text tokens between the timestamp pair.

Absolute Time Computation

Timestamps within the decoded output are relative to the start of the current 30-second window. To compute absolute times in the full audio file:

  • absolute_start = seek_time + relative_start
  • absolute_end = seek_time + relative_end

Where seek_time is the position of the current 30-second window in the overall audio, measured in seconds.

Segment Dictionary Structure

Each assembled segment is represented as a dictionary with the following keys:

Key Type Description
seek int Seek position (in mel frames) of the 30-second window
start float Absolute start time in seconds
end float Absolute end time in seconds
text str Decoded text content of the segment
tokens List[int] Token IDs for the segment
temperature float Temperature used during decoding
avg_logprob float Average log probability of the decoded tokens
compression_ratio float Compression ratio of the text (higher indicates repetition)
no_speech_prob float Probability that the segment contains no speech

Edge Cases

  • No timestamp tokens produced: When without_timestamps=True, the entire decoded output is treated as a single segment spanning the full 30-second window.
  • Single timestamp only: If decoding produces only a start timestamp with no matching end timestamp, the segment may extend to the end of the window.
  • Empty segments: Segments with no text tokens between timestamp pairs are typically discarded.
  • Timestamp-only output: If the decoder produces timestamps with no text, this usually indicates silence and the seek position advances accordingly.

Role in the Pipeline

Segment assembly sits between single-segment decoding and the final output:

  1. Mel spectrogram extraction — audio to features
  2. Single segment decoding — features to token stream
  3. Segment assembly — token stream to time-aligned text segments (this principle)
  4. Output formatting — segments to SRT, VTT, JSON, etc.

See Also

2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment