Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datajuicer Data juicer Filter Compute Stats

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, Data_Quality
Last Updated 2026-02-14 17:00 GMT

Overview

Concrete tool for computing per-sample quality statistics in filter operators provided by the Data-Juicer framework.

Description

The Filter base class defines compute_stats and compute_stats_batched methods that subclasses implement to calculate quality metrics. Statistics are stored in the sample's __dj__stats__ dictionary under keys defined by StatsKeys. The Analyzer orchestration loop sets op.process = None to compute stats without filtering. Each filter (e.g., TextLengthFilter, LanguageIDScoreFilter, PerplexityFilter) implements its own compute_stats with domain-specific logic.

Usage

This is used in the Dataset Quality Analysis workflow to compute statistics for profiling, and in processing workflows as the first phase before filtering.

Code Reference

Source Location

  • Repository: data-juicer
  • File: data_juicer/ops/base_op.py
  • Lines: L666-815

Signature

class Filter(OP):
    def __init__(self, *args, **kwargs):
        """
        Base class that keeps or removes samples based on statistics.
        """

    def compute_stats_single(self, sample, context=False):
        """
        Compute quality statistics for a single sample.

        Args:
            sample: Dict representing one data sample.
            context: Whether to include context information.

        Returns:
            sample with __dj__stats__ populated.
        """
        raise NotImplementedError

    def compute_stats_batched(self, samples, context=False):
        """
        Compute quality statistics for a batch of samples.

        Args:
            samples: Dict of lists (batched format).
            context: Whether to include context information.

        Returns:
            samples with __dj__stats__ populated per sample.
        """

    def process_single(self, sample):
        """
        Filter decision for a single sample.

        Args:
            sample: Dict with __dj__stats__ already computed.

        Returns:
            True to keep sample, False to remove.
        """
        raise NotImplementedError

Import

from data_juicer.ops.base_op import Filter

I/O Contract

Inputs

Name Type Required Description
sample dict Yes Single data sample with text/media fields
samples dict of lists Yes (batched) Batch of samples in columnar format

Outputs

Name Type Description
sample dict Sample with __dj__stats__ dict populated (e.g. text_len, lang_score)
bool (process) bool True to keep, False to remove

Usage Examples

Using Analyzer for Stats Only

from data_juicer.config import init_configs
from data_juicer.core.analyzer import Analyzer

# Analysis mode: compute stats without filtering
cfg = init_configs(args=['--config', 'analysis.yaml'])
analyzer = Analyzer(cfg)
analyzer.run()
# Output: stats files and analysis reports in work_dir

Inspecting Computed Stats

from data_juicer.ops.filter.text_length_filter import TextLengthFilter

# Instantiate a filter
text_filter = TextLengthFilter(min_len=100, max_len=10000)

# Compute stats for a sample
sample = {'text': 'Hello world, this is a test sample.'}
sample = text_filter.compute_stats_single(sample)
print(sample['__dj__stats__'])
# {'text_len': 34}

# Check filter decision
keep = text_filter.process_single(sample)
print(keep)  # False (34 < 100)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment