Principle:Datajuicer Data juicer Statistics Computation

Knowledge Sources	Data-Juicer
Domains	Data_Engineering, Data_Quality
Last Updated	2026-02-14 17:00 GMT

Overview

A two-phase operator pattern that first computes per-sample statistics and then uses those statistics to make filtering decisions.

Description

Statistics Computation is the mechanism by which Filter operators evaluate data quality. Each Filter implements a compute_stats method that calculates metrics (text length, language score, perplexity, etc.) and stores them in a per-sample statistics dictionary under the __dj__stats__ key. In analysis mode, the filtering step is skipped (op.process = None) so that only statistics are computed without removing any samples. This two-phase pattern (compute first, filter second) enables statistics reuse across fused filters and allows analysis workflows to profile data quality without modifying the dataset.

Usage

Use this principle when profiling dataset quality or when building filter operators. In analysis workflows, statistics are computed for all configured filters to produce quality profiles. In processing workflows, statistics drive the actual filtering decisions.

Theoretical Basis

# Abstract algorithm (NOT real implementation)
# Phase 1: Compute statistics
for sample in dataset:
    stats = {}
    for filter_op in filters:
        stats.update(filter_op.compute_stats(sample))
    sample['__dj__stats__'] = stats

# Phase 2 (processing only): Apply filter predicates
if not analysis_mode:
    dataset = dataset.filter(
        lambda sample: all(
            f.process_single(sample) for f in filters
        )
    )

Related Pages

Implemented By

Implementation:Datajuicer_Data_juicer_Filter_Compute_Stats

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment