Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Openai Whisper Decode

From Leeroopedia

Overview

whisper.decode() is the low-level function for decoding a single 30-second mel spectrogram segment into text. It is the programmatic entry point for single-segment speech recognition, wrapping the DecodingTask class and handling input normalization.

Source

Signature

@torch.no_grad()
def decode(
    model: "Whisper",
    mel: Tensor,
    options: DecodingOptions = DecodingOptions(),
    **kwargs,
) -> Union[DecodingResult, List[DecodingResult]]:

Parameters

Parameter Type Description
model Whisper A loaded Whisper model instance.
mel Tensor Mel spectrogram of shape (80, 3000) for a single segment or (batch, 80, 3000) for batched input.
options DecodingOptions Decoding configuration. Defaults to DecodingOptions() (greedy decoding).
**kwargs Keyword arguments that override fields in options. A new DecodingOptions is constructed with the overrides applied.

Inputs and Outputs

  • Inputs: A 30-second mel spectrogram produced by log_mel_spectrogram(). The tensor must have 80 frequency bins and 3000 time frames.
  • Outputs: A DecodingResult for single input or List[DecodingResult] for batched input.

DecodingResult Fields

Field Type Description
audio_features Tensor Encoder output features
language str Detected or specified language code
language_probs Dict[str, float] Probability distribution over languages
tokens List[int] Generated token IDs
text str Decoded text string
avg_logprob float Average log probability of the generated tokens
no_speech_prob float Probability that the segment contains no speech
temperature float Temperature used for decoding
compression_ratio float Text compression ratio (used for failure detection)

Behavior

  1. If **kwargs are provided, a new DecodingOptions is created by replacing the specified fields.
  2. If mel is a 2D tensor (single segment), it is unsqueezed to 3D for batch processing.
  3. A DecodingTask is instantiated with the model and options.
  4. The task's run() method is called on the mel tensor.
  5. If the original input was a single segment (2D), the single DecodingResult is returned directly. Otherwise, the full list is returned.

The function is decorated with @torch.no_grad() to disable gradient computation during inference.

Usage Example

import whisper
from whisper import DecodingOptions

model = whisper.load_model("base")
audio = whisper.load_audio("speech.mp3")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)

options = DecodingOptions(language="en", fp16=False)
result = whisper.decode(model, mel, options)
print(result.text)
print(f"Language: {result.language}")
print(f"No speech prob: {result.no_speech_prob:.3f}")

Key Notes

  • This function processes exactly one 30-second segment. For full audio files, use transcribe() instead.
  • The mel tensor must be on the same device as the model.
  • Setting fp16=False is required for CPU inference.
  • The **kwargs mechanism allows convenient field overrides without constructing a new DecodingOptions manually.

See Also

2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment