Implementation:Openai Whisper Transcribe

Overview

transcribe() is the high-level function that performs end-to-end audio transcription. It orchestrates the entire Whisper pipeline: audio loading, mel spectrogram computation, sliding window decoding with temperature fallback, segment assembly, and optional word-level timestamp extraction. This is the primary user-facing API for Whisper.

Source

File: whisper/transcribe.py:L38-514
Import: from whisper import transcribe or called as model.transcribe(audio) (bound method)
Repository: https://github.com/openai/whisper

Signature

def transcribe(
    model: "Whisper",
    audio: Union[str, np.ndarray, torch.Tensor],
    *,
    verbose: Optional[bool] = None,
    temperature: Union[float, Tuple[float, ...]] = (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    compression_ratio_threshold: Optional[float] = 2.4,
    logprob_threshold: Optional[float] = -1.0,
    no_speech_threshold: Optional[float] = 0.6,
    condition_on_previous_text: bool = True,
    initial_prompt: Optional[str] = None,
    carry_initial_prompt: bool = False,
    word_timestamps: bool = False,
    prepend_punctuations: str = "\"'"¿([{-",
    append_punctuations: str = "\"'.。,，!！?？:：\")]}、",
    clip_timestamps: Union[str, List[float]] = "0",
    hallucination_silence_threshold: Optional[float] = None,
    **decode_options,
) -> dict:

Parameters

Parameter	Type	Default	Description
`model`	`Whisper`	(required)	Loaded Whisper model instance
`audio`	`Union[str, np.ndarray, torch.Tensor]`	(required)	File path, NumPy array, or PyTorch tensor of audio waveform
`verbose`	`Optional[bool]`	`None`	None: no output; True: print each segment; False: print progress bar
`temperature`	`Union[float, Tuple[float, ...]]`	`(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)`	Temperature(s) for fallback strategy. Try each in order on failure.
`compression_ratio_threshold`	`Optional[float]`	`2.4`	Above this threshold, decoding is considered failed (repetitive text).
`logprob_threshold`	`Optional[float]`	`-1.0`	Below this threshold, decoding is considered failed (low confidence).
`no_speech_threshold`	`Optional[float]`	`0.6`	Above this threshold, segment is treated as silence.
`condition_on_previous_text`	`bool`	`True`	Use previous segment's output as prompt for the next segment.
`initial_prompt`	`Optional[str]`	`None`	User-provided text to condition the first segment.
`carry_initial_prompt`	`bool`	`False`	If True, prepend `initial_prompt` to every segment's context.
`word_timestamps`	`bool`	`False`	Enable word-level timestamps via cross-attention DTW.
`prepend_punctuations`	`str`	`"\"'"¿([{-"`	Punctuation merged with the following word for timing.
`append_punctuations`	`str`	`"\"'.。,，!！?？:：\")]}、"`	Punctuation merged with the preceding word for timing.
`clip_timestamps`	`Union[str, List[float]]`	`"0"`	Specific time ranges to process (comma-separated or list).
`hallucination_silence_threshold`	`Optional[float]`	`None`	Duration threshold for detecting hallucinated text during silence.
`**decode_options`			Additional keyword arguments passed to `DecodingOptions` (e.g., `task`, `language`, `beam_size`).

Inputs and Outputs

Inputs

Audio: File path (str), raw waveform (NumPy array at 16kHz), or PyTorch tensor
Model: A loaded Whisper model instance

Outputs

A dictionary with three keys:

Key	Type	Description
`"text"`	`str`	The full transcript as a single concatenated string
`"segments"`	`List[dict]`	List of segment dictionaries with timing and metadata
`"language"`	`str`	The detected or specified language code

Each segment dictionary contains: id, seek, start, end, text, tokens, temperature, avg_logprob, compression_ratio, no_speech_prob. When word_timestamps=True, each segment also contains a "words" list with per-word start, end, word, and probability.

Internal Flow

Load and preprocess audio — convert input to mel spectrogram
Detect language (if not specified) — use first 30-second segment
Initialize seek pointer at frame 0
Main loop — while seek < total frames:
1. Extract 30-second mel segment at current seek position
2. Temperature fallback loop — for each temperature in the tuple:
  - Create DecodingOptions with current temperature and settings
  - Call decode() on the mel segment
  - Check compression ratio and log probability against thresholds
  - If both pass, accept the result and break
  - Otherwise, try next temperature
3. Parse timestamp tokens into segments
4. Apply no-speech detection
5. Optionally compute word-level timestamps via DTW
6. Append segments to result list
7. Update seek position based on last timestamp
8. Update prompt context for next segment
Assemble final result dictionary

Usage Examples

Simple Transcription

import whisper

model = whisper.load_model("base")

# Simple transcription
result = model.transcribe("speech.mp3")
print(result["text"])

Word-Level Timestamps

import whisper

model = whisper.load_model("base")

result = model.transcribe("speech.mp3", word_timestamps=True)
for segment in result["segments"]:
    for word in segment["words"]:
        print(f"[{word['start']:.2f} - {word['end']:.2f}] {word['word']}")

Translation Mode

import whisper

model = whisper.load_model("base")

result = model.transcribe("french_speech.mp3", task="translate")
print(result["text"])  # Output in English

Key Notes

The temperature tuple is the primary robustness mechanism. The default (0.0, 0.2, 0.4, 0.6, 0.8, 1.0) tries greedy first, then progressively more random sampling.
Setting condition_on_previous_text=False can help avoid error propagation across segments but reduces consistency.
The **decode_options are forwarded to DecodingOptions, so parameters like task, language, beam_size, best_of, and fp16 are set here.
The function is also available as a bound method: model.transcribe(audio) is equivalent to transcribe(model, audio).
For CPU inference, pass fp16=False via **decode_options.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment