Implementation:Openai Whisper Decode
Appearance
Overview
whisper.decode() is the low-level function for decoding a single 30-second mel spectrogram segment into text. It is the programmatic entry point for single-segment speech recognition, wrapping the DecodingTask class and handling input normalization.
Source
- File:
whisper/decoding.py:L792-826 - Import:
from whisper import decodeorfrom whisper.decoding import decode - Repository: https://github.com/openai/whisper
Signature
@torch.no_grad()
def decode(
model: "Whisper",
mel: Tensor,
options: DecodingOptions = DecodingOptions(),
**kwargs,
) -> Union[DecodingResult, List[DecodingResult]]:
Parameters
| Parameter | Type | Description |
|---|---|---|
model |
Whisper |
A loaded Whisper model instance. |
mel |
Tensor |
Mel spectrogram of shape (80, 3000) for a single segment or (batch, 80, 3000) for batched input. |
options |
DecodingOptions |
Decoding configuration. Defaults to DecodingOptions() (greedy decoding).
|
**kwargs |
Keyword arguments that override fields in options. A new DecodingOptions is constructed with the overrides applied.
|
Inputs and Outputs
- Inputs: A 30-second mel spectrogram produced by
log_mel_spectrogram(). The tensor must have 80 frequency bins and 3000 time frames. - Outputs: A
DecodingResultfor single input orList[DecodingResult]for batched input.
DecodingResult Fields
| Field | Type | Description |
|---|---|---|
audio_features |
Tensor |
Encoder output features |
language |
str |
Detected or specified language code |
language_probs |
Dict[str, float] |
Probability distribution over languages |
tokens |
List[int] |
Generated token IDs |
text |
str |
Decoded text string |
avg_logprob |
float |
Average log probability of the generated tokens |
no_speech_prob |
float |
Probability that the segment contains no speech |
temperature |
float |
Temperature used for decoding |
compression_ratio |
float |
Text compression ratio (used for failure detection) |
Behavior
- If
**kwargsare provided, a newDecodingOptionsis created by replacing the specified fields. - If
melis a 2D tensor (single segment), it is unsqueezed to 3D for batch processing. - A
DecodingTaskis instantiated with the model and options. - The task's
run()method is called on the mel tensor. - If the original input was a single segment (2D), the single
DecodingResultis returned directly. Otherwise, the full list is returned.
The function is decorated with @torch.no_grad() to disable gradient computation during inference.
Usage Example
import whisper
from whisper import DecodingOptions
model = whisper.load_model("base")
audio = whisper.load_audio("speech.mp3")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
options = DecodingOptions(language="en", fp16=False)
result = whisper.decode(model, mel, options)
print(result.text)
print(f"Language: {result.language}")
print(f"No speech prob: {result.no_speech_prob:.3f}")
Key Notes
- This function processes exactly one 30-second segment. For full audio files, use
transcribe()instead. - The
meltensor must be on the same device as the model. - Setting
fp16=Falseis required for CPU inference. - The
**kwargsmechanism allows convenient field overrides without constructing a newDecodingOptionsmanually.
See Also
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment