Implementation:Datajuicer Data juicer AudioSizeFilter
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Filtering |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for filtering data samples based on audio file size provided by Data-Juicer.
Description
AudioSizeFilter is a filter operator that keeps data samples based on the size of their audio files. It checks if the audio file sizes fall within a specified range (e.g., bytes, KB, MB). The key metric audio_sizes is an array of file sizes in bytes. The operator supports 'any' (keep if any audio meets the size condition) and 'all' (keep only if all audios meet the condition) strategies. It extends the Filter base class and implements the two-phase compute_stats/process pattern.
Usage
Import this operator when you need to filter dataset samples based on the file size of audio files. Configure it in your Data-Juicer YAML config or instantiate directly.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/filter/audio_size_filter.py
- Lines: 1-71
Signature
@OPERATORS.register_module("audio_size_filter")
class AudioSizeFilter(Filter):
def __init__(self, min_size: str = "0", max_size: str = "1TB", any_or_all: str = "any", *args, **kwargs):
...
Import
from data_juicer.ops.filter.audio_size_filter import AudioSizeFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| min_size | str | No | The minimum audio size to keep samples (e.g., "0", "1KB", "5MB"). Default: "0" |
| max_size | str | No | The maximum audio size to keep samples. Default: "1TB" |
| any_or_all | str | No | Keep strategy: 'any' keeps if any audio meets condition, 'all' keeps only if all audios meet condition. Default: "any" |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Filtered samples with stats field updated (audio_sizes) |
Usage Examples
YAML Configuration
process:
- audio_size_filter:
min_size: "0"
max_size: "1TB"
any_or_all: "any"
Python API
from data_juicer.ops.filter.audio_size_filter import AudioSizeFilter
op = AudioSizeFilter(min_size="0", max_size="100MB")
# Apply to dataset
result = dataset.process(op)