Implementation:Datajuicer Data juicer Filter Compute Stats
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Data_Quality |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Concrete tool for computing per-sample quality statistics in filter operators provided by the Data-Juicer framework.
Description
The Filter base class defines compute_stats and compute_stats_batched methods that subclasses implement to calculate quality metrics. Statistics are stored in the sample's __dj__stats__ dictionary under keys defined by StatsKeys. The Analyzer orchestration loop sets op.process = None to compute stats without filtering. Each filter (e.g., TextLengthFilter, LanguageIDScoreFilter, PerplexityFilter) implements its own compute_stats with domain-specific logic.
Usage
This is used in the Dataset Quality Analysis workflow to compute statistics for profiling, and in processing workflows as the first phase before filtering.
Code Reference
Source Location
- Repository: data-juicer
- File: data_juicer/ops/base_op.py
- Lines: L666-815
Signature
class Filter(OP):
def __init__(self, *args, **kwargs):
"""
Base class that keeps or removes samples based on statistics.
"""
def compute_stats_single(self, sample, context=False):
"""
Compute quality statistics for a single sample.
Args:
sample: Dict representing one data sample.
context: Whether to include context information.
Returns:
sample with __dj__stats__ populated.
"""
raise NotImplementedError
def compute_stats_batched(self, samples, context=False):
"""
Compute quality statistics for a batch of samples.
Args:
samples: Dict of lists (batched format).
context: Whether to include context information.
Returns:
samples with __dj__stats__ populated per sample.
"""
def process_single(self, sample):
"""
Filter decision for a single sample.
Args:
sample: Dict with __dj__stats__ already computed.
Returns:
True to keep sample, False to remove.
"""
raise NotImplementedError
Import
from data_juicer.ops.base_op import Filter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| sample | dict | Yes | Single data sample with text/media fields |
| samples | dict of lists | Yes (batched) | Batch of samples in columnar format |
Outputs
| Name | Type | Description |
|---|---|---|
| sample | dict | Sample with __dj__stats__ dict populated (e.g. text_len, lang_score) |
| bool (process) | bool | True to keep, False to remove |
Usage Examples
Using Analyzer for Stats Only
from data_juicer.config import init_configs
from data_juicer.core.analyzer import Analyzer
# Analysis mode: compute stats without filtering
cfg = init_configs(args=['--config', 'analysis.yaml'])
analyzer = Analyzer(cfg)
analyzer.run()
# Output: stats files and analysis reports in work_dir
Inspecting Computed Stats
from data_juicer.ops.filter.text_length_filter import TextLengthFilter
# Instantiate a filter
text_filter = TextLengthFilter(min_len=100, max_len=10000)
# Compute stats for a sample
sample = {'text': 'Hello world, this is a test sample.'}
sample = text_filter.compute_stats_single(sample)
print(sample['__dj__stats__'])
# {'text_len': 34}
# Check filter decision
keep = text_filter.process_single(sample)
print(keep) # False (34 < 100)