Implementation:Datajuicer Data juicer Filter Compute Stats

Knowledge Sources	Data-Juicer
Domains	Data_Engineering, Data_Quality
Last Updated	2026-02-14 17:00 GMT

Overview

Concrete tool for computing per-sample quality statistics in filter operators provided by the Data-Juicer framework.

Description

The Filter base class defines compute_stats and compute_stats_batched methods that subclasses implement to calculate quality metrics. Statistics are stored in the sample's __dj__stats__ dictionary under keys defined by StatsKeys. The Analyzer orchestration loop sets op.process = None to compute stats without filtering. Each filter (e.g., TextLengthFilter, LanguageIDScoreFilter, PerplexityFilter) implements its own compute_stats with domain-specific logic.

Usage

This is used in the Dataset Quality Analysis workflow to compute statistics for profiling, and in processing workflows as the first phase before filtering.

Code Reference

Source Location

Repository: data-juicer
File: data_juicer/ops/base_op.py
Lines: L666-815

Signature

class Filter(OP):
    def __init__(self, *args, **kwargs):
        """
        Base class that keeps or removes samples based on statistics.
        """

    def compute_stats_single(self, sample, context=False):
        """
        Compute quality statistics for a single sample.

        Args:
            sample: Dict representing one data sample.
            context: Whether to include context information.

        Returns:
            sample with __dj__stats__ populated.
        """
        raise NotImplementedError

    def compute_stats_batched(self, samples, context=False):
        """
        Compute quality statistics for a batch of samples.

        Args:
            samples: Dict of lists (batched format).
            context: Whether to include context information.

        Returns:
            samples with __dj__stats__ populated per sample.
        """

    def process_single(self, sample):
        """
        Filter decision for a single sample.

        Args:
            sample: Dict with __dj__stats__ already computed.

        Returns:
            True to keep sample, False to remove.
        """
        raise NotImplementedError

Import

from data_juicer.ops.base_op import Filter

I/O Contract

Inputs

Name	Type	Required	Description
sample	dict	Yes	Single data sample with text/media fields
samples	dict of lists	Yes (batched)	Batch of samples in columnar format

Outputs

Name	Type	Description
sample	dict	Sample with __dj__stats__ dict populated (e.g. text_len, lang_score)
bool (process)	bool	True to keep, False to remove

Usage Examples

Using Analyzer for Stats Only

from data_juicer.config import init_configs
from data_juicer.core.analyzer import Analyzer

# Analysis mode: compute stats without filtering
cfg = init_configs(args=['--config', 'analysis.yaml'])
analyzer = Analyzer(cfg)
analyzer.run()
# Output: stats files and analysis reports in work_dir

Inspecting Computed Stats

from data_juicer.ops.filter.text_length_filter import TextLengthFilter

# Instantiate a filter
text_filter = TextLengthFilter(min_len=100, max_len=10000)

# Compute stats for a sample
sample = {'text': 'Hello world, this is a test sample.'}
sample = text_filter.compute_stats_single(sample)
print(sample['__dj__stats__'])
# {'text_len': 34}

# Check filter decision
keep = text_filter.process_single(sample)
print(keep)  # False (34 < 100)

Related Pages

Implements Principle

Principle:Datajuicer_Data_juicer_Statistics_Computation

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment