Principle:Openai Whisper Segment Assembly

Overview

Segment Assembly is the process of parsing decoded tokens into time-aligned text segments. Whisper's decoder produces a stream of tokens that interleaves text tokens with special timestamp tokens. The segment assembly logic interprets this interleaved stream to produce structured segments, each with a start time, end time, and text content.

This is a Pattern Doc — it describes inline logic within the transcribe() function, not a standalone API or algorithm.

Timestamp Token Design

Whisper uses a set of special tokens to encode timing information directly in the decoder's output vocabulary. These timestamp tokens represent discrete time points within the 30-second decoding window:

Timestamp tokens have IDs starting from a special timestamp_begin offset in the vocabulary.
Each token represents a time value computed as: time = (token_id - timestamp_begin) * 0.02 seconds.
With a precision of 0.02 seconds (20 milliseconds), the 30-second window is covered by 1500 timestamp tokens (representing times from 0.00 to 30.00 seconds).

Token Stream Structure

The decoder's output follows a specific pattern of interleaved text and timestamp tokens:

<|0.00|> text tokens <|2.40|> <|2.40|> text tokens <|5.80|> ...

The pattern works as follows:

A start timestamp token marks the beginning of a text segment.
One or more text tokens form the content of that segment.
An end timestamp token marks the end of the segment.
The end timestamp of one segment is typically equal to the start timestamp of the next segment.

Parsing Logic

The segment assembly logic scans through the decoded token sequence and identifies timestamp pairs:

Iterate through all decoded tokens.
When a timestamp token is encountered, determine whether it is a start or end timestamp based on its position in the alternating pattern.
When a complete pair (start timestamp, text tokens, end timestamp) is found, create a new segment.
Extract the text by decoding only the text tokens between the timestamp pair.

Absolute Time Computation

Timestamps within the decoded output are relative to the start of the current 30-second window. To compute absolute times in the full audio file:

absolute_start = seek_time + relative_start
absolute_end = seek_time + relative_end

Where seek_time is the position of the current 30-second window in the overall audio, measured in seconds.

Segment Dictionary Structure

Each assembled segment is represented as a dictionary with the following keys:

Key	Type	Description
`seek`	`int`	Seek position (in mel frames) of the 30-second window
`start`	`float`	Absolute start time in seconds
`end`	`float`	Absolute end time in seconds
`text`	`str`	Decoded text content of the segment
`tokens`	`List[int]`	Token IDs for the segment
`temperature`	`float`	Temperature used during decoding
`avg_logprob`	`float`	Average log probability of the decoded tokens
`compression_ratio`	`float`	Compression ratio of the text (higher indicates repetition)
`no_speech_prob`	`float`	Probability that the segment contains no speech

Edge Cases

No timestamp tokens produced: When without_timestamps=True, the entire decoded output is treated as a single segment spanning the full 30-second window.
Single timestamp only: If decoding produces only a start timestamp with no matching end timestamp, the segment may extend to the end of the window.
Empty segments: Segments with no text tokens between timestamp pairs are typically discarded.
Timestamp-only output: If the decoder produces timestamps with no text, this usually indicates silence and the seek position advances accordingly.

Role in the Pipeline

Segment assembly sits between single-segment decoding and the final output:

Mel spectrogram extraction — audio to features
Single segment decoding — features to token stream
Segment assembly — token stream to time-aligned text segments (this principle)
Output formatting — segments to SRT, VTT, JSON, etc.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment