Implementation:Openai Whisper Transcribe Timestamp Parsing
Overview
This is a Pattern Doc. The timestamp parsing logic is not a standalone function or class but rather inline code within the transcribe() function at whisper/transcribe.py:L339-399. It parses interleaved text and timestamp tokens from the decoder output into structured, time-aligned text segments.
Source
- File:
whisper/transcribe.py:L339-399(timestamp parsing),whisper/transcribe.py:L246-261(new_segmenthelper) - Repository: https://github.com/openai/whisper
Pattern Interface
Input
- Decoded tokens from
DecodingTask, containing interleaved text tokens and timestamp tokens - Current seek position (in mel spectrogram frames)
Output
List[dict]— a list of segment dictionaries, each with keys:seek,start,end,text,tokens,temperature,avg_logprob,compression_ratio,no_speech_prob
Timestamp Token Identification
Timestamp tokens are identified by their token ID. Any token with an ID greater than or equal to tokenizer.timestamp_begin is a timestamp token. The actual time value is computed as:
time = (token_id - tokenizer.timestamp_begin) * time_precision
Where time_precision = 0.02 seconds (20 milliseconds per frame).
The new_segment Helper
The new_segment function (defined at L246-261) creates a segment dictionary from the parsed components:
def new_segment(*, start, end, tokens, result):
# creates dict with:
# seek: current seek position
# start: absolute start time in seconds
# end: absolute end time in seconds
# text: tokenizer.decode(tokens)
# tokens: list of token IDs
# temperature: result.temperature
# avg_logprob: result.avg_logprob
# compression_ratio: result.compression_ratio
# no_speech_prob: result.no_speech_prob
This helper bundles the segment's text content with its timing information and decoding metadata (temperature, log probability, compression ratio, no-speech probability) into a single dictionary.
Parsing Algorithm
The timestamp parsing logic scans through the decoded token sequence looking for timestamp token pairs that bracket text segments:
# Inside transcribe() loop — pseudocode representation
for token in decoded_tokens:
if token >= tokenizer.timestamp_begin:
timestamp = (token - tokenizer.timestamp_begin) * time_precision
if is_start_timestamp:
current_start = timestamp
else:
current_end = timestamp
segments.append(new_segment(start=current_start, end=current_end, ...))
Detailed Steps
- Scan through the decoded token list from left to right.
- Identify timestamp tokens by checking if
token_id >= tokenizer.timestamp_begin. - Determine role: Based on position in the alternating pattern, determine whether a timestamp token marks a segment start or end.
- Extract text tokens: Collect all non-timestamp tokens between a start/end timestamp pair.
- Compute absolute times: Add the seek-based offset to convert window-relative timestamps to absolute audio timestamps:
start = seek_time + relative_startend = seek_time + relative_end
- Create segment: Call
new_segment()with the computed start, end, tokens, and theDecodingResultmetadata. - Append the segment to the output list.
Seek Advancement
After parsing all timestamp tokens from a decoded segment, the seek pointer advances to the position of the last timestamp found. This determines where the next 30-second window starts:
- If the last timestamp is at 20.0 seconds within the window, the seek advances by 20.0 seconds worth of frames (1000 frames).
- If no timestamps were produced, the seek advances by the full 30-second window (1500 frames).
Handling Special Cases
- No timestamp tokens: When
without_timestamps=Truewas set inDecodingOptions, the entire decoded output is treated as one segment spanning the full window. - Consecutive timestamps with no text: These represent silence or pauses and advance the seek without creating a text segment.
- Single unpaired timestamp: If decoding ends with only a start timestamp and no matching end, the segment extends to the end of the decoded content.
Context in the Pipeline
This parsing logic runs inside the main transcribe() loop, after each call to decode():
transcribe()extracts a 30-second mel windowdecode()produces aDecodingResultwith tokens- Timestamp parsing (this pattern) converts tokens to segments
- Segments are accumulated into the final result list
- The seek pointer advances based on the parsed timestamps