Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Openai Whisper Detect Language

From Leeroopedia

Overview

detect_language() identifies the spoken language in an audio segment by running a single decoder forward pass with a start-of-transcript token and selecting the most probable language token. It returns both the predicted language token and a probability distribution over all supported languages.

Source

Signature

@torch.no_grad()
def detect_language(
    model: "Whisper", mel: Tensor, tokenizer: Tokenizer = None
) -> Tuple[Tensor, List[dict]]:

Import

from whisper.decoding import detect_language
# or
import whisper  # re-exported; also available as model.detect_language()

Parameters

Parameter Type Default Description
model Whisper (required) A loaded multilingual Whisper model instance. English-only models are not supported
mel Tensor (required) Mel spectrogram of shape (n_mels, n_frames) for single input or (batch, n_mels, n_frames) for batched input. Can also be pre-encoded audio features
tokenizer Tokenizer None Optional tokenizer instance. If None, one is automatically created from the model's configuration

Inputs and Outputs

Inputs

  • A mel spectrogram tensor (padded/trimmed to 30 seconds) from the audio preprocessing pipeline, or pre-encoded audio features

Outputs

Returns a tuple of two elements:

  • language_tokensTensor of shape (batch,) containing the most probable language token IDs (argmax of the language probability distribution)
  • language_probsList[Dict[str, float]] where each dictionary maps language codes (e.g., "en", "fr", "zh") to their probability scores. One dictionary per batch element

Behavior

  1. Validates that the model is multilingual (not an English-only variant)
  2. Handles input dimensions — if mel is unbatched (n_mels, n_frames), unsqueezes to add a batch dimension
  3. Encodes the mel spectrogram through the audio encoder (if not already encoded). Detects pre-encoded input by checking if the last dimension is the model's hidden size
  4. Creates an initial token tensor containing a single SOT (start-of-transcript) token per batch element
  5. Runs a single decoder forward pass to get logits for the next token position
  6. Suppresses all non-language tokens by setting their logits to negative infinity
  7. Computes argmax over the remaining logits to get the predicted language token per batch element
  8. Computes softmax to get the full probability distribution over language tokens
  9. Maps token IDs back to language code strings using the tokenizer
  10. Returns the language token tensor and the list of probability dictionaries

The function is decorated with @torch.no_grad() to disable gradient computation, reducing memory usage and improving speed during inference.

Example

import whisper

# Load a multilingual model
model = whisper.load_model("base")

# Preprocess audio
audio = whisper.load_audio("speech.mp3")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# Detect language (using model method shortcut)
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")

# Print top 5 languages by probability
for lang, prob in sorted(probs.items(), key=lambda x: x[1], reverse=True)[:5]:
    print(f"  {lang}: {prob:.4f}")

# Batched detection
import torch
batch = torch.stack([mel, mel]).to(model.device)  # batch of 2
tokens, probs_list = whisper.detect_language(model, batch)
for i, probs in enumerate(probs_list):
    detected = max(probs, key=probs.get)
    print(f"Audio {i}: {detected} ({probs[detected]:.4f})")

Notes

  • This function only works with multilingual models. Calling it with an English-only model (e.g., tiny.en) will raise a ValueError
  • The mel spectrogram should be padded or trimmed to 30 seconds (3000 frames) before passing to this function
  • Language detection requires only one decoder step, making it significantly faster than full transcription
  • The @torch.no_grad() decorator ensures no gradient tensors are stored, minimizing memory usage
  • The function can accept pre-encoded audio features (output of the encoder) to avoid redundant encoding when the encoder output is reused for subsequent transcription
  • For the most accurate results, the audio should contain actual speech rather than silence or music

Metadata

Principle:Openai_Whisper_Language_Detection Environment:Openai_Whisper_PyTorch_CUDA 2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment