Principle:Openai Whisper Median Filtering
Overview
Median Filtering is a non-linear signal processing technique used to smooth noisy data by replacing each value with the median of its neighboring values within a sliding window. In Whisper's word-level timestamp pipeline, median filtering is applied to cross-attention weight matrices along the time dimension to reduce frame-level noise while preserving the overall alignment pattern between text tokens and audio frames.
Domain
- Signal Processing
- Speech Recognition
How Median Filtering Works
A median filter operates by sliding a window of fixed width k across a signal. At each position, the values within the window are sorted, and the center value is replaced by the median (middle value) of the sorted window.
For a 1D signal x with filter width k = 2w + 1 (where w is the half-width):
y[i] = median(x[i-w], x[i-w+1], ..., x[i], ..., x[i+w-1], x[i+w])
At boundaries, the signal is typically extended using reflect padding to avoid edge artifacts.
Properties
- Edge-preserving: Unlike mean/Gaussian filters, median filters preserve sharp transitions and edges in the signal. This is critical for attention weights where sharp boundaries between "attending" and "not attending" correspond to word boundaries.
- Impulse noise removal: Median filters are particularly effective at removing salt-and-pepper noise (isolated outlier values), which commonly appears as sporadic high-attention spikes in individual frames.
- Non-linear: The median operation is non-linear, meaning it cannot be represented as a convolution. This gives it unique denoising properties compared to linear filters.
- Idempotent on smooth signals: If a signal is already smooth (locally monotonic within the window), the median filter leaves it unchanged.
Application in Whisper
In Whisper's cross-attention alignment pipeline, median filtering serves a specific purpose:
- Input: Raw cross-attention weights from alignment heads, after z-score normalization. These weights have shape (num_heads, num_tokens, num_frames).
- Filtering axis: The median filter is applied along the time (frame) dimension -- the last axis of the attention matrix.
- Window width: Default width is 7 frames (configurable via
medfilt_widthparameter). - Output: Smoothed attention weights with reduced noise, ready for head averaging and DTW alignment.
Why Median Filtering for Attention Weights
Cross-attention weights can exhibit:
- Frame-level jitter: Individual frames may have spuriously high or low attention values due to acoustic noise or model uncertainty.
- Multi-modal attention: A token may attend to multiple frames, creating noisy patterns.
- Numerical artifacts: Floating-point precision issues can introduce small outliers.
Median filtering addresses all of these while preserving the fundamental alignment pattern -- the monotonically increasing "diagonal" of high attention that DTW needs to track.
Comparison with Other Smoothing Methods
| Method | Edge Preservation | Outlier Robustness | Computational Cost |
|---|---|---|---|
| Median Filter | Excellent | Excellent | Moderate (sort-based) |
| Mean Filter | Poor | Poor | Low |
| Gaussian Filter | Moderate | Poor | Low |
| Bilateral Filter | Excellent | Moderate | High |
Implementation
Implementation:Openai_Whisper_Median_Filter