Implementation:Datajuicer Data juicer AudioNMFSNRFilter
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Filtering |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for filtering data samples based on audio Signal-to-Noise Ratio computed via NMF provided by Data-Juicer.
Description
AudioNMFSNRFilter is a filter operator that keeps data samples whose audio Signal-to-Noise Ratios (SNRs) are within a specified range. It uses Non-negative Matrix Factorization (NMF) to decompose each audio spectrogram into signal and noise components via STFT, then computes the SNR in dB. The SNR is cached under the audio_nmf_snr stats key. The operator supports 'any' or 'all' strategies across multiple audios per sample. It extends the Filter base class and implements the two-phase compute_stats/process pattern.
Usage
Import this operator when you need to filter dataset samples based on the signal-to-noise ratio of audio files. Configure it in your Data-Juicer YAML config or instantiate directly.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/filter/audio_nmf_snr_filter.py
- Lines: 1-134
Signature
@OPERATORS.register_module("audio_nmf_snr_filter")
@LOADED_AUDIOS.register_module("audio_nmf_snr_filter")
class AudioNMFSNRFilter(Filter):
def __init__(
self,
min_snr: float = 0,
max_snr: float = sys.maxsize,
nmf_iter_num: PositiveInt = 500,
any_or_all: str = "any",
*args,
**kwargs,
):
...
Import
from data_juicer.ops.filter.audio_nmf_snr_filter import AudioNMFSNRFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| min_snr | float | No | The minimum audio SNR to keep samples in dB. Default: 0 |
| max_snr | float | No | The maximum audio SNR to keep samples in dB. Default: sys.maxsize |
| nmf_iter_num | PositiveInt | No | The maximum number of iterations to run NMF. Default: 500 |
| any_or_all | str | No | Keep strategy: 'any' keeps if any audio meets condition, 'all' keeps only if all audios meet condition. Default: "any" |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Filtered samples with stats field updated (audio_nmf_snr) |
Usage Examples
YAML Configuration
process:
- audio_nmf_snr_filter:
min_snr: 0
max_snr: 50
nmf_iter_num: 500
any_or_all: "any"
Python API
from data_juicer.ops.filter.audio_nmf_snr_filter import AudioNMFSNRFilter
op = AudioNMFSNRFilter(min_snr=0, max_snr=50, nmf_iter_num=500)
# Apply to dataset
result = dataset.process(op)