Implementation:Openai Whisper Decode

Overview

whisper.decode() is the low-level function for decoding a single 30-second mel spectrogram segment into text. It is the programmatic entry point for single-segment speech recognition, wrapping the DecodingTask class and handling input normalization.

Source

File: whisper/decoding.py:L792-826
Import: from whisper import decode or from whisper.decoding import decode
Repository: https://github.com/openai/whisper

Signature

@torch.no_grad()
def decode(
    model: "Whisper",
    mel: Tensor,
    options: DecodingOptions = DecodingOptions(),
    **kwargs,
) -> Union[DecodingResult, List[DecodingResult]]:

Parameters

Parameter	Type	Description
`model`	`Whisper`	A loaded Whisper model instance.
`mel`	`Tensor`	Mel spectrogram of shape (80, 3000) for a single segment or (batch, 80, 3000) for batched input.
`options`	`DecodingOptions`	Decoding configuration. Defaults to `DecodingOptions()` (greedy decoding).
`**kwargs`		Keyword arguments that override fields in `options`. A new `DecodingOptions` is constructed with the overrides applied.

Inputs and Outputs

Inputs: A 30-second mel spectrogram produced by log_mel_spectrogram(). The tensor must have 80 frequency bins and 3000 time frames.
Outputs: A DecodingResult for single input or List[DecodingResult] for batched input.

DecodingResult Fields

Field	Type	Description
`audio_features`	`Tensor`	Encoder output features
`language`	`str`	Detected or specified language code
`language_probs`	`Dict[str, float]`	Probability distribution over languages
`tokens`	`List[int]`	Generated token IDs
`text`	`str`	Decoded text string
`avg_logprob`	`float`	Average log probability of the generated tokens
`no_speech_prob`	`float`	Probability that the segment contains no speech
`temperature`	`float`	Temperature used for decoding
`compression_ratio`	`float`	Text compression ratio (used for failure detection)

Behavior

If **kwargs are provided, a new DecodingOptions is created by replacing the specified fields.
If mel is a 2D tensor (single segment), it is unsqueezed to 3D for batch processing.
A DecodingTask is instantiated with the model and options.
The task's run() method is called on the mel tensor.
If the original input was a single segment (2D), the single DecodingResult is returned directly. Otherwise, the full list is returned.

The function is decorated with @torch.no_grad() to disable gradient computation during inference.

Usage Example

import whisper
from whisper import DecodingOptions

model = whisper.load_model("base")
audio = whisper.load_audio("speech.mp3")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)

options = DecodingOptions(language="en", fp16=False)
result = whisper.decode(model, mel, options)
print(result.text)
print(f"Language: {result.language}")
print(f"No speech prob: {result.no_speech_prob:.3f}")

Key Notes

This function processes exactly one 30-second segment. For full audio files, use transcribe() instead.
The mel tensor must be on the same device as the model.
Setting fp16=False is required for CPU inference.
The **kwargs mechanism allows convenient field overrides without constructing a new DecodingOptions manually.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment