Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer StatsKeysConstant Registration

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, Data_Quality
Last Updated 2026-02-14 17:00 GMT

Overview

Pattern documentation for registering quality metric key names in the Data-Juicer StatsKeysConstant registry.

Description

StatsKeysConstant is a class in data_juicer/utils/constant.py that defines string constant attributes for all quality metrics. Operators reference these constants when writing to or reading from the __dj__stats__ dictionary. The Fields class defines the top-level field names (e.g., stats = __dj__stats__, meta = __dj__meta__).

Usage

Add a new attribute to StatsKeysConstant when defining a new quality metric for a custom filter. Reference it via StatsKeys.your_key_name in operator code.

Code Reference

Source Location

  • Repository: data-juicer
  • File: data_juicer/utils/constant.py
  • Lines: L15-30 (Fields), L238-367 (StatsKeysConstant)

Interface Specification

class Fields(object):
    stats = "__dj__stats__"
    meta = "__dj__meta__"
    batch_meta = "__dj__batch_meta__"
    context = "__dj__context__"
    suffix = "__dj__suffix__"
    source_file = "__dj__source_file__"

class StatsKeysConstant(object):
    # Text statistics
    alpha_token_ratio = "alpha_token_ratio"
    alnum_ratio = "alnum_ratio"
    avg_line_length = "avg_line_length"
    char_rep_ratio = "char_rep_ratio"
    flagged_words_ratio = "flagged_words_ratio"
    lang = "lang"
    lang_score = "lang_score"
    max_line_length = "max_line_length"
    perplexity = "perplexity"
    special_char_ratio = "special_char_ratio"
    stopwords_ratio = "stopwords_ratio"
    text_len = "text_len"
    word_rep_ratio = "word_rep_ratio"
    words_num = "words_num"
    # ... more keys for image, video, audio metrics

# Singleton instance
StatsKeys = StatsKeysConstant()

Import

from data_juicer.utils.constant import Fields, StatsKeys

I/O Contract

Inputs

Name Type Required Description
key_name str Yes Attribute name to add to StatsKeysConstant
key_value str Yes String value for the stats key (usually same as attribute name)

Outputs

Name Type Description
StatsKeys.key_name str Accessible constant for use in operator code

Usage Examples

Defining and Using a Stats Key

from data_juicer.utils.constant import Fields, StatsKeys

# In your custom filter's compute_stats:
def compute_stats_single(self, sample, context=False):
    text = sample[self.text_key]
    # Use the standard key
    sample[Fields.stats][StatsKeys.text_len] = len(text)
    return sample

# For a new custom key, add to StatsKeysConstant:
# class StatsKeysConstant:
#     my_custom_metric = "my_custom_metric"

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment