Implementation:Openai Whisper Detect Language

Overview

detect_language() identifies the spoken language in an audio segment by running a single decoder forward pass with a start-of-transcript token and selecting the most probable language token. It returns both the predicted language token and a probability distribution over all supported languages.

Source

File: whisper/decoding.py:L18-77
Repository: https://github.com/openai/whisper

Signature

@torch.no_grad()
def detect_language(
    model: "Whisper", mel: Tensor, tokenizer: Tokenizer = None
) -> Tuple[Tensor, List[dict]]:

Import

from whisper.decoding import detect_language
# or
import whisper  # re-exported; also available as model.detect_language()

Parameters

Parameter	Type	Default	Description
model	Whisper	(required)	A loaded multilingual Whisper model instance. English-only models are not supported
mel	Tensor	(required)	Mel spectrogram of shape (n_mels, n_frames) for single input or (batch, n_mels, n_frames) for batched input. Can also be pre-encoded audio features
tokenizer	Tokenizer	None	Optional tokenizer instance. If None, one is automatically created from the model's configuration

Inputs and Outputs

Inputs

A mel spectrogram tensor (padded/trimmed to 30 seconds) from the audio preprocessing pipeline, or pre-encoded audio features

Outputs

Returns a tuple of two elements:

language_tokens — Tensor of shape (batch,) containing the most probable language token IDs (argmax of the language probability distribution)
language_probs — List[Dict[str, float]] where each dictionary maps language codes (e.g., "en", "fr", "zh") to their probability scores. One dictionary per batch element

Behavior

Validates that the model is multilingual (not an English-only variant)
Handles input dimensions — if mel is unbatched (n_mels, n_frames), unsqueezes to add a batch dimension
Encodes the mel spectrogram through the audio encoder (if not already encoded). Detects pre-encoded input by checking if the last dimension is the model's hidden size
Creates an initial token tensor containing a single SOT (start-of-transcript) token per batch element
Runs a single decoder forward pass to get logits for the next token position
Suppresses all non-language tokens by setting their logits to negative infinity
Computes argmax over the remaining logits to get the predicted language token per batch element
Computes softmax to get the full probability distribution over language tokens
Maps token IDs back to language code strings using the tokenizer
Returns the language token tensor and the list of probability dictionaries

The function is decorated with @torch.no_grad() to disable gradient computation, reducing memory usage and improving speed during inference.

Example

import whisper

# Load a multilingual model
model = whisper.load_model("base")

# Preprocess audio
audio = whisper.load_audio("speech.mp3")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# Detect language (using model method shortcut)
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")

# Print top 5 languages by probability
for lang, prob in sorted(probs.items(), key=lambda x: x[1], reverse=True)[:5]:
    print(f"  {lang}: {prob:.4f}")

# Batched detection
import torch
batch = torch.stack([mel, mel]).to(model.device)  # batch of 2
tokens, probs_list = whisper.detect_language(model, batch)
for i, probs in enumerate(probs_list):
    detected = max(probs, key=probs.get)
    print(f"Audio {i}: {detected} ({probs[detected]:.4f})")

Notes

This function only works with multilingual models. Calling it with an English-only model (e.g., tiny.en) will raise a ValueError
The mel spectrogram should be padded or trimmed to 30 seconds (3000 frames) before passing to this function
Language detection requires only one decoder step, making it significantly faster than full transcription
The @torch.no_grad() decorator ensures no gradient tensors are stored, minimizing memory usage
The function can accept pre-encoded audio features (output of the encoder) to avoid redundant encoding when the encoder output is reused for subsequent transcription
For the most accurate results, the audio should contain actual speech rather than silence or music

Metadata

Principle:Openai_Whisper_Language_Detection Environment:Openai_Whisper_PyTorch_CUDA 2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment