Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Datajuicer Data juicer Statistics Key Definition

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, Data_Quality
Last Updated 2026-02-14 17:00 GMT

Overview

A constants registry pattern that defines standardized key names for per-sample quality statistics to ensure consistency across operators.

Description

Statistics Key Definition provides a centralized registry of string constants that identify quality metrics computed by filter operators. Each key (e.g., text_len, lang_score, perplexity) is defined once in a StatsKeysConstant class and referenced by all operators that compute or consume that metric. This prevents typos, enables auto-discovery of all available metrics, and allows the analysis system to introspect which metrics have been computed. The keys correspond to entries in the per-sample __dj__stats__ dictionary.

Usage

Use this principle when defining statistics for a new custom filter operator. Register a new constant in StatsKeysConstant before using it in compute_stats.

Theoretical Basis

# Abstract pattern (NOT real implementation)
class StatsKeys:
    text_len = "text_len"
    lang_score = "lang_score"
    perplexity = "perplexity"
    # ... all metric names

# Usage in filter:
sample['__dj__stats__'][StatsKeys.text_len] = len(text)

The centralized definition ensures type safety and discoverability at the cost of requiring registration.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment