Implementation:Datajuicer Data juicer AudioDurationFilter
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Filtering |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for filtering data samples based on audio file duration provided by Data-Juicer.
Description
AudioDurationFilter is a filter operator that keeps data samples whose audio durations are within a specified range in seconds. It loads audio files using librosa and computes their durations via librosa.get_duration(), caching results under the audio_duration stats key. The operator supports 'any' (keep if any audio qualifies) or 'all' (keep only if every audio qualifies) strategies. It extends the Filter base class and implements the two-phase compute_stats/process pattern.
Usage
Import this operator when you need to filter dataset samples based on the duration of audio files. Configure it in your Data-Juicer YAML config or instantiate directly.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/filter/audio_duration_filter.py
- Lines: 1-90
Signature
@OPERATORS.register_module("audio_duration_filter")
@LOADED_AUDIOS.register_module("audio_duration_filter")
class AudioDurationFilter(Filter):
def __init__(
self, min_duration: int = 0, max_duration: int = sys.maxsize, any_or_all: str = "any", *args, **kwargs
):
...
Import
from data_juicer.ops.filter.audio_duration_filter import AudioDurationFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| min_duration | int | No | The minimum audio duration to keep samples in seconds. Default: 0 |
| max_duration | int | No | The maximum audio duration to keep samples in seconds. Default: sys.maxsize |
| any_or_all | str | No | Keep strategy: 'any' keeps if any audio meets condition, 'all' keeps only if all audios meet condition. Default: "any" |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Filtered samples with stats field updated (audio_duration) |
Usage Examples
YAML Configuration
process:
- audio_duration_filter:
min_duration: 0
max_duration: 300
any_or_all: "any"
Python API
from data_juicer.ops.filter.audio_duration_filter import AudioDurationFilter
op = AudioDurationFilter(min_duration=0, max_duration=300)
# Apply to dataset
result = dataset.process(op)