Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Openai Whisper Median Filtering

From Leeroopedia

Overview

Median Filtering is a non-linear signal processing technique used to smooth noisy data by replacing each value with the median of its neighboring values within a sliding window. In Whisper's word-level timestamp pipeline, median filtering is applied to cross-attention weight matrices along the time dimension to reduce frame-level noise while preserving the overall alignment pattern between text tokens and audio frames.

Domain

  • Signal Processing
  • Speech Recognition

How Median Filtering Works

A median filter operates by sliding a window of fixed width k across a signal. At each position, the values within the window are sorted, and the center value is replaced by the median (middle value) of the sorted window.

For a 1D signal x with filter width k = 2w + 1 (where w is the half-width):

y[i] = median(x[i-w], x[i-w+1], ..., x[i], ..., x[i+w-1], x[i+w])

At boundaries, the signal is typically extended using reflect padding to avoid edge artifacts.

Properties

  • Edge-preserving: Unlike mean/Gaussian filters, median filters preserve sharp transitions and edges in the signal. This is critical for attention weights where sharp boundaries between "attending" and "not attending" correspond to word boundaries.
  • Impulse noise removal: Median filters are particularly effective at removing salt-and-pepper noise (isolated outlier values), which commonly appears as sporadic high-attention spikes in individual frames.
  • Non-linear: The median operation is non-linear, meaning it cannot be represented as a convolution. This gives it unique denoising properties compared to linear filters.
  • Idempotent on smooth signals: If a signal is already smooth (locally monotonic within the window), the median filter leaves it unchanged.

Application in Whisper

In Whisper's cross-attention alignment pipeline, median filtering serves a specific purpose:

  1. Input: Raw cross-attention weights from alignment heads, after z-score normalization. These weights have shape (num_heads, num_tokens, num_frames).
  2. Filtering axis: The median filter is applied along the time (frame) dimension -- the last axis of the attention matrix.
  3. Window width: Default width is 7 frames (configurable via medfilt_width parameter).
  4. Output: Smoothed attention weights with reduced noise, ready for head averaging and DTW alignment.

Why Median Filtering for Attention Weights

Cross-attention weights can exhibit:

  • Frame-level jitter: Individual frames may have spuriously high or low attention values due to acoustic noise or model uncertainty.
  • Multi-modal attention: A token may attend to multiple frames, creating noisy patterns.
  • Numerical artifacts: Floating-point precision issues can introduce small outliers.

Median filtering addresses all of these while preserving the fundamental alignment pattern -- the monotonically increasing "diagonal" of high attention that DTW needs to track.

Comparison with Other Smoothing Methods

Method Edge Preservation Outlier Robustness Computational Cost
Median Filter Excellent Excellent Moderate (sort-based)
Mean Filter Poor Poor Low
Gaussian Filter Moderate Poor Low
Bilateral Filter Excellent Moderate High

Implementation

Implementation:Openai_Whisper_Median_Filter

Metadata

2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment