Principle:Openai Whisper Median Filtering

Overview

Median Filtering is a non-linear signal processing technique used to smooth noisy data by replacing each value with the median of its neighboring values within a sliding window. In Whisper's word-level timestamp pipeline, median filtering is applied to cross-attention weight matrices along the time dimension to reduce frame-level noise while preserving the overall alignment pattern between text tokens and audio frames.

Domain

Signal Processing
Speech Recognition

How Median Filtering Works

A median filter operates by sliding a window of fixed width k across a signal. At each position, the values within the window are sorted, and the center value is replaced by the median (middle value) of the sorted window.

For a 1D signal x with filter width k = 2w + 1 (where w is the half-width):

y[i] = median(x[i-w], x[i-w+1], ..., x[i], ..., x[i+w-1], x[i+w])

At boundaries, the signal is typically extended using reflect padding to avoid edge artifacts.

Properties

Edge-preserving: Unlike mean/Gaussian filters, median filters preserve sharp transitions and edges in the signal. This is critical for attention weights where sharp boundaries between "attending" and "not attending" correspond to word boundaries.
Impulse noise removal: Median filters are particularly effective at removing salt-and-pepper noise (isolated outlier values), which commonly appears as sporadic high-attention spikes in individual frames.
Non-linear: The median operation is non-linear, meaning it cannot be represented as a convolution. This gives it unique denoising properties compared to linear filters.
Idempotent on smooth signals: If a signal is already smooth (locally monotonic within the window), the median filter leaves it unchanged.

Application in Whisper

In Whisper's cross-attention alignment pipeline, median filtering serves a specific purpose:

Input: Raw cross-attention weights from alignment heads, after z-score normalization. These weights have shape (num_heads, num_tokens, num_frames).
Filtering axis: The median filter is applied along the time (frame) dimension -- the last axis of the attention matrix.
Window width: Default width is 7 frames (configurable via medfilt_width parameter).
Output: Smoothed attention weights with reduced noise, ready for head averaging and DTW alignment.

Why Median Filtering for Attention Weights

Cross-attention weights can exhibit:

Frame-level jitter: Individual frames may have spuriously high or low attention values due to acoustic noise or model uncertainty.
Multi-modal attention: A token may attend to multiple frames, creating noisy patterns.
Numerical artifacts: Floating-point precision issues can introduce small outliers.

Median filtering addresses all of these while preserving the fundamental alignment pattern -- the monotonically increasing "diagonal" of high attention that DTW needs to track.

Comparison with Other Smoothing Methods

Method	Edge Preservation	Outlier Robustness	Computational Cost
Median Filter	Excellent	Excellent	Moderate (sort-based)
Mean Filter	Poor	Poor	Low
Gaussian Filter	Moderate	Poor	Low
Bilateral Filter	Excellent	Moderate	High

Implementation

Implementation:Openai_Whisper_Median_Filter

Metadata

2025-06-25 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment