Principle:Datajuicer Data juicer Statistics Computation
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Data_Quality |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
A two-phase operator pattern that first computes per-sample statistics and then uses those statistics to make filtering decisions.
Description
Statistics Computation is the mechanism by which Filter operators evaluate data quality. Each Filter implements a compute_stats method that calculates metrics (text length, language score, perplexity, etc.) and stores them in a per-sample statistics dictionary under the __dj__stats__ key. In analysis mode, the filtering step is skipped (op.process = None) so that only statistics are computed without removing any samples. This two-phase pattern (compute first, filter second) enables statistics reuse across fused filters and allows analysis workflows to profile data quality without modifying the dataset.
Usage
Use this principle when profiling dataset quality or when building filter operators. In analysis workflows, statistics are computed for all configured filters to produce quality profiles. In processing workflows, statistics drive the actual filtering decisions.
Theoretical Basis
# Abstract algorithm (NOT real implementation)
# Phase 1: Compute statistics
for sample in dataset:
stats = {}
for filter_op in filters:
stats.update(filter_op.compute_stats(sample))
sample['__dj__stats__'] = stats
# Phase 2 (processing only): Apply filter predicates
if not analysis_mode:
dataset = dataset.filter(
lambda sample: all(
f.process_single(sample) for f in filters
)
)