Implementation:Datajuicer Data juicer AlphanumericFilter

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Quality, Filtering
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for filtering data samples based on alphanumeric character ratio provided by Data-Juicer.

Description

AlphanumericFilter is a filter operator that keeps samples with an alphabet/numeric ratio within a specific range. It calculates either an alnum_ratio (alphanumeric characters / total characters) or an alpha_token_ratio (alphabetic characters / tokenizer tokens) depending on the tokenization flag, optionally using a HuggingFace tokenizer (EleutherAI/pythia-6.9b-deduped). It extends the Filter base class and implements the two-phase compute_stats/process pattern.

Usage

Import this operator when you need to filter dataset samples based on the ratio of alphanumeric characters to total text length. Configure it in your Data-Juicer YAML config or instantiate directly.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/filter/alphanumeric_filter.py
Lines: 1-89

Signature

@OPERATORS.register_module("alphanumeric_filter")
class AlphanumericFilter(Filter):
    def __init__(
        self, tokenization: bool = False, min_ratio: float = 0.25, max_ratio: float = sys.maxsize, *args, **kwargs
    ):
        ...

Import

from data_juicer.ops.filter.alphanumeric_filter import AlphanumericFilter

I/O Contract

Inputs

Name	Type	Required	Description
tokenization	bool	No	Whether to count the ratio of alphanumeric to the total number of tokens instead of characters. Default: False
min_ratio	float	No	The minimum filter ratio; samples below this are filtered out. Default: 0.25
max_ratio	float	No	The maximum filter ratio; samples above this are filtered out. Default: sys.maxsize

Outputs

Name	Type	Description
samples	Dict	Filtered samples with stats field updated (alnum_ratio or alpha_token_ratio)

Usage Examples

YAML Configuration

process:
  - alphanumeric_filter:
      tokenization: false
      min_ratio: 0.25
      max_ratio: 9223372036854775807

Python API

from data_juicer.ops.filter.alphanumeric_filter import AlphanumericFilter

op = AlphanumericFilter(tokenization=False, min_ratio=0.25)
# Apply to dataset
result = dataset.process(op)

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment