Principle:Openai Whisper Segment Assembly
Overview
Segment Assembly is the process of parsing decoded tokens into time-aligned text segments. Whisper's decoder produces a stream of tokens that interleaves text tokens with special timestamp tokens. The segment assembly logic interprets this interleaved stream to produce structured segments, each with a start time, end time, and text content.
This is a Pattern Doc — it describes inline logic within the transcribe() function, not a standalone API or algorithm.
Timestamp Token Design
Whisper uses a set of special tokens to encode timing information directly in the decoder's output vocabulary. These timestamp tokens represent discrete time points within the 30-second decoding window:
- Timestamp tokens have IDs starting from a special
timestamp_beginoffset in the vocabulary. - Each token represents a time value computed as: time = (token_id - timestamp_begin) * 0.02 seconds.
- With a precision of 0.02 seconds (20 milliseconds), the 30-second window is covered by 1500 timestamp tokens (representing times from 0.00 to 30.00 seconds).
Token Stream Structure
The decoder's output follows a specific pattern of interleaved text and timestamp tokens:
<|0.00|> text tokens <|2.40|> <|2.40|> text tokens <|5.80|> ...
The pattern works as follows:
- A start timestamp token marks the beginning of a text segment.
- One or more text tokens form the content of that segment.
- An end timestamp token marks the end of the segment.
- The end timestamp of one segment is typically equal to the start timestamp of the next segment.
Parsing Logic
The segment assembly logic scans through the decoded token sequence and identifies timestamp pairs:
- Iterate through all decoded tokens.
- When a timestamp token is encountered, determine whether it is a start or end timestamp based on its position in the alternating pattern.
- When a complete pair (start timestamp, text tokens, end timestamp) is found, create a new segment.
- Extract the text by decoding only the text tokens between the timestamp pair.
Absolute Time Computation
Timestamps within the decoded output are relative to the start of the current 30-second window. To compute absolute times in the full audio file:
- absolute_start = seek_time + relative_start
- absolute_end = seek_time + relative_end
Where seek_time is the position of the current 30-second window in the overall audio, measured in seconds.
Segment Dictionary Structure
Each assembled segment is represented as a dictionary with the following keys:
| Key | Type | Description |
|---|---|---|
seek |
int |
Seek position (in mel frames) of the 30-second window |
start |
float |
Absolute start time in seconds |
end |
float |
Absolute end time in seconds |
text |
str |
Decoded text content of the segment |
tokens |
List[int] |
Token IDs for the segment |
temperature |
float |
Temperature used during decoding |
avg_logprob |
float |
Average log probability of the decoded tokens |
compression_ratio |
float |
Compression ratio of the text (higher indicates repetition) |
no_speech_prob |
float |
Probability that the segment contains no speech |
Edge Cases
- No timestamp tokens produced: When
without_timestamps=True, the entire decoded output is treated as a single segment spanning the full 30-second window. - Single timestamp only: If decoding produces only a start timestamp with no matching end timestamp, the segment may extend to the end of the window.
- Empty segments: Segments with no text tokens between timestamp pairs are typically discarded.
- Timestamp-only output: If the decoder produces timestamps with no text, this usually indicates silence and the seek position advances accordingly.
Role in the Pipeline
Segment assembly sits between single-segment decoding and the final output:
- Mel spectrogram extraction — audio to features
- Single segment decoding — features to token stream
- Segment assembly — token stream to time-aligned text segments (this principle)
- Output formatting — segments to SRT, VTT, JSON, etc.