Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Datajuicer Data juicer Statistics Computation

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, Data_Quality
Last Updated 2026-02-14 17:00 GMT

Overview

A two-phase operator pattern that first computes per-sample statistics and then uses those statistics to make filtering decisions.

Description

Statistics Computation is the mechanism by which Filter operators evaluate data quality. Each Filter implements a compute_stats method that calculates metrics (text length, language score, perplexity, etc.) and stores them in a per-sample statistics dictionary under the __dj__stats__ key. In analysis mode, the filtering step is skipped (op.process = None) so that only statistics are computed without removing any samples. This two-phase pattern (compute first, filter second) enables statistics reuse across fused filters and allows analysis workflows to profile data quality without modifying the dataset.

Usage

Use this principle when profiling dataset quality or when building filter operators. In analysis workflows, statistics are computed for all configured filters to produce quality profiles. In processing workflows, statistics drive the actual filtering decisions.

Theoretical Basis

# Abstract algorithm (NOT real implementation)
# Phase 1: Compute statistics
for sample in dataset:
    stats = {}
    for filter_op in filters:
        stats.update(filter_op.compute_stats(sample))
    sample['__dj__stats__'] = stats

# Phase 2 (processing only): Apply filter predicates
if not analysis_mode:
    dataset = dataset.filter(
        lambda sample: all(
            f.process_single(sample) for f in filters
        )
    )

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment