Implementation:Datajuicer Data juicer StatsKeysConstant Registration
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Data_Quality |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Pattern documentation for registering quality metric key names in the Data-Juicer StatsKeysConstant registry.
Description
StatsKeysConstant is a class in data_juicer/utils/constant.py that defines string constant attributes for all quality metrics. Operators reference these constants when writing to or reading from the __dj__stats__ dictionary. The Fields class defines the top-level field names (e.g., stats = __dj__stats__, meta = __dj__meta__).
Usage
Add a new attribute to StatsKeysConstant when defining a new quality metric for a custom filter. Reference it via StatsKeys.your_key_name in operator code.
Code Reference
Source Location
- Repository: data-juicer
- File: data_juicer/utils/constant.py
- Lines: L15-30 (Fields), L238-367 (StatsKeysConstant)
Interface Specification
class Fields(object):
stats = "__dj__stats__"
meta = "__dj__meta__"
batch_meta = "__dj__batch_meta__"
context = "__dj__context__"
suffix = "__dj__suffix__"
source_file = "__dj__source_file__"
class StatsKeysConstant(object):
# Text statistics
alpha_token_ratio = "alpha_token_ratio"
alnum_ratio = "alnum_ratio"
avg_line_length = "avg_line_length"
char_rep_ratio = "char_rep_ratio"
flagged_words_ratio = "flagged_words_ratio"
lang = "lang"
lang_score = "lang_score"
max_line_length = "max_line_length"
perplexity = "perplexity"
special_char_ratio = "special_char_ratio"
stopwords_ratio = "stopwords_ratio"
text_len = "text_len"
word_rep_ratio = "word_rep_ratio"
words_num = "words_num"
# ... more keys for image, video, audio metrics
# Singleton instance
StatsKeys = StatsKeysConstant()
Import
from data_juicer.utils.constant import Fields, StatsKeys
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| key_name | str | Yes | Attribute name to add to StatsKeysConstant |
| key_value | str | Yes | String value for the stats key (usually same as attribute name) |
Outputs
| Name | Type | Description |
|---|---|---|
| StatsKeys.key_name | str | Accessible constant for use in operator code |
Usage Examples
Defining and Using a Stats Key
from data_juicer.utils.constant import Fields, StatsKeys
# In your custom filter's compute_stats:
def compute_stats_single(self, sample, context=False):
text = sample[self.text_key]
# Use the standard key
sample[Fields.stats][StatsKeys.text_len] = len(text)
return sample
# For a new custom key, add to StatsKeysConstant:
# class StatsKeysConstant:
# my_custom_metric = "my_custom_metric"