Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Openai Whisper Transcribe

From Leeroopedia

Overview

transcribe() is the high-level function that performs end-to-end audio transcription. It orchestrates the entire Whisper pipeline: audio loading, mel spectrogram computation, sliding window decoding with temperature fallback, segment assembly, and optional word-level timestamp extraction. This is the primary user-facing API for Whisper.

Source

  • File: whisper/transcribe.py:L38-514
  • Import: from whisper import transcribe or called as model.transcribe(audio) (bound method)
  • Repository: https://github.com/openai/whisper

Signature

def transcribe(
    model: "Whisper",
    audio: Union[str, np.ndarray, torch.Tensor],
    *,
    verbose: Optional[bool] = None,
    temperature: Union[float, Tuple[float, ...]] = (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    compression_ratio_threshold: Optional[float] = 2.4,
    logprob_threshold: Optional[float] = -1.0,
    no_speech_threshold: Optional[float] = 0.6,
    condition_on_previous_text: bool = True,
    initial_prompt: Optional[str] = None,
    carry_initial_prompt: bool = False,
    word_timestamps: bool = False,
    prepend_punctuations: str = "\"'"¿([{-",
    append_punctuations: str = "\"'.。,,!!??::\")]}、",
    clip_timestamps: Union[str, List[float]] = "0",
    hallucination_silence_threshold: Optional[float] = None,
    **decode_options,
) -> dict:

Parameters

Parameter Type Default Description
model Whisper (required) Loaded Whisper model instance
audio Union[str, np.ndarray, torch.Tensor] (required) File path, NumPy array, or PyTorch tensor of audio waveform
verbose Optional[bool] None None: no output; True: print each segment; False: print progress bar
temperature Union[float, Tuple[float, ...]] (0.0, 0.2, 0.4, 0.6, 0.8, 1.0) Temperature(s) for fallback strategy. Try each in order on failure.
compression_ratio_threshold Optional[float] 2.4 Above this threshold, decoding is considered failed (repetitive text).
logprob_threshold Optional[float] -1.0 Below this threshold, decoding is considered failed (low confidence).
no_speech_threshold Optional[float] 0.6 Above this threshold, segment is treated as silence.
condition_on_previous_text bool True Use previous segment's output as prompt for the next segment.
initial_prompt Optional[str] None User-provided text to condition the first segment.
carry_initial_prompt bool False If True, prepend initial_prompt to every segment's context.
word_timestamps bool False Enable word-level timestamps via cross-attention DTW.
prepend_punctuations str "\"'"¿([{-" Punctuation merged with the following word for timing.
append_punctuations str "\"'.。,,!!??::\")]}、" Punctuation merged with the preceding word for timing.
clip_timestamps Union[str, List[float]] "0" Specific time ranges to process (comma-separated or list).
hallucination_silence_threshold Optional[float] None Duration threshold for detecting hallucinated text during silence.
**decode_options Additional keyword arguments passed to DecodingOptions (e.g., task, language, beam_size).

Inputs and Outputs

Inputs

  • Audio: File path (str), raw waveform (NumPy array at 16kHz), or PyTorch tensor
  • Model: A loaded Whisper model instance

Outputs

A dictionary with three keys:

Key Type Description
"text" str The full transcript as a single concatenated string
"segments" List[dict] List of segment dictionaries with timing and metadata
"language" str The detected or specified language code

Each segment dictionary contains: id, seek, start, end, text, tokens, temperature, avg_logprob, compression_ratio, no_speech_prob. When word_timestamps=True, each segment also contains a "words" list with per-word start, end, word, and probability.

Internal Flow

  1. Load and preprocess audio — convert input to mel spectrogram
  2. Detect language (if not specified) — use first 30-second segment
  3. Initialize seek pointer at frame 0
  4. Main loop — while seek < total frames:
    1. Extract 30-second mel segment at current seek position
    2. Temperature fallback loop — for each temperature in the tuple:
      • Create DecodingOptions with current temperature and settings
      • Call decode() on the mel segment
      • Check compression ratio and log probability against thresholds
      • If both pass, accept the result and break
      • Otherwise, try next temperature
    3. Parse timestamp tokens into segments
    4. Apply no-speech detection
    5. Optionally compute word-level timestamps via DTW
    6. Append segments to result list
    7. Update seek position based on last timestamp
    8. Update prompt context for next segment
  5. Assemble final result dictionary

Usage Examples

Simple Transcription

import whisper

model = whisper.load_model("base")

# Simple transcription
result = model.transcribe("speech.mp3")
print(result["text"])

Word-Level Timestamps

import whisper

model = whisper.load_model("base")

result = model.transcribe("speech.mp3", word_timestamps=True)
for segment in result["segments"]:
    for word in segment["words"]:
        print(f"[{word['start']:.2f} - {word['end']:.2f}] {word['word']}")

Translation Mode

import whisper

model = whisper.load_model("base")

result = model.transcribe("french_speech.mp3", task="translate")
print(result["text"])  # Output in English

Key Notes

  • The temperature tuple is the primary robustness mechanism. The default (0.0, 0.2, 0.4, 0.6, 0.8, 1.0) tries greedy first, then progressively more random sampling.
  • Setting condition_on_previous_text=False can help avoid error propagation across segments but reduces consistency.
  • The **decode_options are forwarded to DecodingOptions, so parameters like task, language, beam_size, best_of, and fp16 are set here.
  • The function is also available as a bound method: model.transcribe(audio) is equivalent to transcribe(model, audio).
  • For CPU inference, pass fp16=False via **decode_options.

See Also

2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment