Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Openai Whisper Language Detection And Decoding

From Leeroopedia
Knowledge Sources
Domains Speech_Recognition, Language_Identification, Audio_Processing
Last Updated 2025-06-25 00:00 GMT

Overview

Low-level pipeline for processing a single audio segment through explicit audio preprocessing, language identification, and controlled decoding using Whisper's granular API.

Description

This workflow demonstrates the lower-level Whisper API for users who need fine-grained control over individual processing stages rather than the high-level transcribe() function. It covers manual audio loading and preprocessing, explicit mel spectrogram computation, language detection as a standalone operation, and single-segment decoding with configurable options. This is useful for building custom pipelines, integrating Whisper into larger systems, processing pre-segmented audio, or performing language identification without full transcription.

Key capabilities:

  • Manual audio preprocessing with explicit control over each stage
  • Standalone language identification across 99+ languages
  • Single-segment decoding with full control over decoding parameters
  • Direct access to decoding metadata (log probabilities, compression ratio, no-speech probability)
  • Suitable for building custom streaming or batched processing pipelines

Usage

Execute this workflow when you need control beyond what model.transcribe() provides: processing pre-segmented audio clips, performing language identification only, customizing the decoding strategy per segment, integrating with a custom VAD (Voice Activity Detection) pipeline, or building a streaming transcription system that processes segments individually.

Execution Steps

Step 1: Audio Loading

Load the audio file using ffmpeg and convert it to the expected format: a mono waveform at 16 kHz sample rate in float32 representation. This step handles any audio format that ffmpeg supports and performs resampling and channel mixing as needed.

Key considerations:

  • ffmpeg must be installed and available in PATH
  • Any audio format supported by ffmpeg can be used as input
  • Output is always mono, 16 kHz, float32 normalized to [-1.0, 1.0]
  • The resulting NumPy array can be further sliced or manipulated before processing

Step 2: Audio Padding and Trimming

Pad or trim the audio waveform to exactly 30 seconds (480,000 samples at 16 kHz), which is the fixed input size expected by the Whisper encoder. Short audio is zero-padded on the right; long audio is truncated.

Key considerations:

  • The encoder always expects exactly 30 seconds of audio (480,000 samples)
  • For longer audio, this step should be applied per-segment in a custom sliding window
  • Works with both NumPy arrays and PyTorch tensors
  • Zero-padding does not affect transcription quality for short segments

Step 3: Mel Spectrogram Computation

Compute the log-Mel spectrogram from the preprocessed audio waveform. This transforms the time-domain signal into a frequency-domain representation using STFT followed by Mel filterbank projection and log scaling.

What happens:

  • Short-Time Fourier Transform with 400-sample window (25ms) and 160-sample hop (10ms)
  • Magnitude squaring of complex STFT output
  • Projection through pre-computed Mel filterbank (80 or 128 bands)
  • Log10 scaling with clamping and normalization
  • Result is a tensor of shape (n_mels, 3000) for a 30-second segment

Step 4: Language Detection

Pass the mel spectrogram through the audio encoder, then use a single decoder step with the start-of-transcript token to obtain language token probabilities. The most probable language token identifies the spoken language.

What happens:

  • Audio encoder produces feature representations from the mel spectrogram
  • Decoder receives a single start-of-transcript token
  • Logits for all language tokens are extracted and softmaxed
  • All non-language tokens are masked to negative infinity before argmax
  • Returns both the predicted language token and the full probability distribution

Step 5: Decoding Options Configuration

Configure the decoding behavior through the DecodingOptions dataclass. This controls the task (transcribe vs. translate), sampling strategy (greedy vs. beam search), temperature, token suppression, timestamp handling, and precision settings.

Key parameters:

  • task: "transcribe" for same-language recognition, "translate" for X→English
  • temperature: 0 for greedy decoding, >0 for sampling
  • beam_size: Number of beams for beam search (mutually exclusive with best_of)
  • best_of: Number of candidates for sampling (requires temperature > 0)
  • suppress_tokens: Token IDs to suppress during generation
  • without_timestamps: Disable timestamp token generation
  • fp16: Use half precision for faster inference (default True on GPU)

Step 6: Single Segment Decoding

Run the decoding task on the mel spectrogram with the configured options. The decoder generates text tokens autoregressively, applying logit filters for timestamp rules, blank suppression, and token suppression at each step. The result includes the decoded text, token sequence, and quality metrics.

What happens:

  • Audio features are extracted by the encoder (or reused if pre-encoded)
  • Initial tokens are constructed from the task specifier sequence (SOT, language, task, timestamps)
  • Autoregressive generation with KV caching for efficiency
  • Logit filters enforce timestamp pairing rules and suppress forbidden tokens
  • Greedy decoder selects argmax, or beam search maintains multiple hypotheses
  • Result includes text, tokens, average log probability, no-speech probability, compression ratio

Execution Diagram

GitHub URL

Workflow Repository