Implementation:Openai Whisper Detect Language
Overview
detect_language() identifies the spoken language in an audio segment by running a single decoder forward pass with a start-of-transcript token and selecting the most probable language token. It returns both the predicted language token and a probability distribution over all supported languages.
Source
- File: whisper/decoding.py:L18-77
- Repository: https://github.com/openai/whisper
Signature
@torch.no_grad()
def detect_language(
model: "Whisper", mel: Tensor, tokenizer: Tokenizer = None
) -> Tuple[Tensor, List[dict]]:
Import
from whisper.decoding import detect_language
# or
import whisper # re-exported; also available as model.detect_language()
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| model | Whisper | (required) | A loaded multilingual Whisper model instance. English-only models are not supported |
| mel | Tensor | (required) | Mel spectrogram of shape (n_mels, n_frames) for single input or (batch, n_mels, n_frames) for batched input. Can also be pre-encoded audio features |
| tokenizer | Tokenizer | None | Optional tokenizer instance. If None, one is automatically created from the model's configuration |
Inputs and Outputs
Inputs
- A mel spectrogram tensor (padded/trimmed to 30 seconds) from the audio preprocessing pipeline, or pre-encoded audio features
Outputs
Returns a tuple of two elements:
- language_tokens — Tensor of shape (batch,) containing the most probable language token IDs (argmax of the language probability distribution)
- language_probs — List[Dict[str, float]] where each dictionary maps language codes (e.g., "en", "fr", "zh") to their probability scores. One dictionary per batch element
Behavior
- Validates that the model is multilingual (not an English-only variant)
- Handles input dimensions — if mel is unbatched (n_mels, n_frames), unsqueezes to add a batch dimension
- Encodes the mel spectrogram through the audio encoder (if not already encoded). Detects pre-encoded input by checking if the last dimension is the model's hidden size
- Creates an initial token tensor containing a single SOT (start-of-transcript) token per batch element
- Runs a single decoder forward pass to get logits for the next token position
- Suppresses all non-language tokens by setting their logits to negative infinity
- Computes argmax over the remaining logits to get the predicted language token per batch element
- Computes softmax to get the full probability distribution over language tokens
- Maps token IDs back to language code strings using the tokenizer
- Returns the language token tensor and the list of probability dictionaries
The function is decorated with @torch.no_grad() to disable gradient computation, reducing memory usage and improving speed during inference.
Example
import whisper
# Load a multilingual model
model = whisper.load_model("base")
# Preprocess audio
audio = whisper.load_audio("speech.mp3")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
# Detect language (using model method shortcut)
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")
# Print top 5 languages by probability
for lang, prob in sorted(probs.items(), key=lambda x: x[1], reverse=True)[:5]:
print(f" {lang}: {prob:.4f}")
# Batched detection
import torch
batch = torch.stack([mel, mel]).to(model.device) # batch of 2
tokens, probs_list = whisper.detect_language(model, batch)
for i, probs in enumerate(probs_list):
detected = max(probs, key=probs.get)
print(f"Audio {i}: {detected} ({probs[detected]:.4f})")
Notes
- This function only works with multilingual models. Calling it with an English-only model (e.g., tiny.en) will raise a ValueError
- The mel spectrogram should be padded or trimmed to 30 seconds (3000 frames) before passing to this function
- Language detection requires only one decoder step, making it significantly faster than full transcription
- The @torch.no_grad() decorator ensures no gradient tensors are stored, minimizing memory usage
- The function can accept pre-encoded audio features (output of the encoder) to avoid redundant encoding when the encoder output is reused for subsequent transcription
- For the most accurate results, the audio should contain actual speech rather than silence or music
Metadata
Principle:Openai_Whisper_Language_Detection Environment:Openai_Whisper_PyTorch_CUDA 2025-06-25 00:00 GMT