Implementation:Openai Whisper Transcribe
Overview
transcribe() is the high-level function that performs end-to-end audio transcription. It orchestrates the entire Whisper pipeline: audio loading, mel spectrogram computation, sliding window decoding with temperature fallback, segment assembly, and optional word-level timestamp extraction. This is the primary user-facing API for Whisper.
Source
- File:
whisper/transcribe.py:L38-514 - Import:
from whisper import transcribeor called asmodel.transcribe(audio)(bound method) - Repository: https://github.com/openai/whisper
Signature
def transcribe(
model: "Whisper",
audio: Union[str, np.ndarray, torch.Tensor],
*,
verbose: Optional[bool] = None,
temperature: Union[float, Tuple[float, ...]] = (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
compression_ratio_threshold: Optional[float] = 2.4,
logprob_threshold: Optional[float] = -1.0,
no_speech_threshold: Optional[float] = 0.6,
condition_on_previous_text: bool = True,
initial_prompt: Optional[str] = None,
carry_initial_prompt: bool = False,
word_timestamps: bool = False,
prepend_punctuations: str = "\"'"¿([{-",
append_punctuations: str = "\"'.。,,!!??::\")]}、",
clip_timestamps: Union[str, List[float]] = "0",
hallucination_silence_threshold: Optional[float] = None,
**decode_options,
) -> dict:
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
Whisper |
(required) | Loaded Whisper model instance |
audio |
Union[str, np.ndarray, torch.Tensor] |
(required) | File path, NumPy array, or PyTorch tensor of audio waveform |
verbose |
Optional[bool] |
None |
None: no output; True: print each segment; False: print progress bar |
temperature |
Union[float, Tuple[float, ...]] |
(0.0, 0.2, 0.4, 0.6, 0.8, 1.0) |
Temperature(s) for fallback strategy. Try each in order on failure. |
compression_ratio_threshold |
Optional[float] |
2.4 |
Above this threshold, decoding is considered failed (repetitive text). |
logprob_threshold |
Optional[float] |
-1.0 |
Below this threshold, decoding is considered failed (low confidence). |
no_speech_threshold |
Optional[float] |
0.6 |
Above this threshold, segment is treated as silence. |
condition_on_previous_text |
bool |
True |
Use previous segment's output as prompt for the next segment. |
initial_prompt |
Optional[str] |
None |
User-provided text to condition the first segment. |
carry_initial_prompt |
bool |
False |
If True, prepend initial_prompt to every segment's context.
|
word_timestamps |
bool |
False |
Enable word-level timestamps via cross-attention DTW. |
prepend_punctuations |
str |
"\"'"¿([{-" |
Punctuation merged with the following word for timing. |
append_punctuations |
str |
"\"'.。,,!!??::\")]}、" |
Punctuation merged with the preceding word for timing. |
clip_timestamps |
Union[str, List[float]] |
"0" |
Specific time ranges to process (comma-separated or list). |
hallucination_silence_threshold |
Optional[float] |
None |
Duration threshold for detecting hallucinated text during silence. |
**decode_options |
Additional keyword arguments passed to DecodingOptions (e.g., task, language, beam_size).
|
Inputs and Outputs
Inputs
- Audio: File path (str), raw waveform (NumPy array at 16kHz), or PyTorch tensor
- Model: A loaded
Whispermodel instance
Outputs
A dictionary with three keys:
| Key | Type | Description |
|---|---|---|
"text" |
str |
The full transcript as a single concatenated string |
"segments" |
List[dict] |
List of segment dictionaries with timing and metadata |
"language" |
str |
The detected or specified language code |
Each segment dictionary contains: id, seek, start, end, text, tokens, temperature, avg_logprob, compression_ratio, no_speech_prob. When word_timestamps=True, each segment also contains a "words" list with per-word start, end, word, and probability.
Internal Flow
- Load and preprocess audio — convert input to mel spectrogram
- Detect language (if not specified) — use first 30-second segment
- Initialize seek pointer at frame 0
- Main loop — while seek < total frames:
- Extract 30-second mel segment at current seek position
- Temperature fallback loop — for each temperature in the tuple:
- Create
DecodingOptionswith current temperature and settings - Call
decode()on the mel segment - Check compression ratio and log probability against thresholds
- If both pass, accept the result and break
- Otherwise, try next temperature
- Create
- Parse timestamp tokens into segments
- Apply no-speech detection
- Optionally compute word-level timestamps via DTW
- Append segments to result list
- Update seek position based on last timestamp
- Update prompt context for next segment
- Assemble final result dictionary
Usage Examples
Simple Transcription
import whisper
model = whisper.load_model("base")
# Simple transcription
result = model.transcribe("speech.mp3")
print(result["text"])
Word-Level Timestamps
import whisper
model = whisper.load_model("base")
result = model.transcribe("speech.mp3", word_timestamps=True)
for segment in result["segments"]:
for word in segment["words"]:
print(f"[{word['start']:.2f} - {word['end']:.2f}] {word['word']}")
Translation Mode
import whisper
model = whisper.load_model("base")
result = model.transcribe("french_speech.mp3", task="translate")
print(result["text"]) # Output in English
Key Notes
- The
temperaturetuple is the primary robustness mechanism. The default(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)tries greedy first, then progressively more random sampling. - Setting
condition_on_previous_text=Falsecan help avoid error propagation across segments but reduces consistency. - The
**decode_optionsare forwarded toDecodingOptions, so parameters liketask,language,beam_size,best_of, andfp16are set here. - The function is also available as a bound method:
model.transcribe(audio)is equivalent totranscribe(model, audio). - For CPU inference, pass
fp16=Falsevia**decode_options.