Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer AlphanumericFilter

From Leeroopedia
Knowledge Sources
Domains Data_Quality, Filtering
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for filtering data samples based on alphanumeric character ratio provided by Data-Juicer.

Description

AlphanumericFilter is a filter operator that keeps samples with an alphabet/numeric ratio within a specific range. It calculates either an alnum_ratio (alphanumeric characters / total characters) or an alpha_token_ratio (alphabetic characters / tokenizer tokens) depending on the tokenization flag, optionally using a HuggingFace tokenizer (EleutherAI/pythia-6.9b-deduped). It extends the Filter base class and implements the two-phase compute_stats/process pattern.

Usage

Import this operator when you need to filter dataset samples based on the ratio of alphanumeric characters to total text length. Configure it in your Data-Juicer YAML config or instantiate directly.

Code Reference

Source Location

Signature

@OPERATORS.register_module("alphanumeric_filter")
class AlphanumericFilter(Filter):
    def __init__(
        self, tokenization: bool = False, min_ratio: float = 0.25, max_ratio: float = sys.maxsize, *args, **kwargs
    ):
        ...

Import

from data_juicer.ops.filter.alphanumeric_filter import AlphanumericFilter

I/O Contract

Inputs

Name Type Required Description
tokenization bool No Whether to count the ratio of alphanumeric to the total number of tokens instead of characters. Default: False
min_ratio float No The minimum filter ratio; samples below this are filtered out. Default: 0.25
max_ratio float No The maximum filter ratio; samples above this are filtered out. Default: sys.maxsize

Outputs

Name Type Description
samples Dict Filtered samples with stats field updated (alnum_ratio or alpha_token_ratio)

Usage Examples

YAML Configuration

process:
  - alphanumeric_filter:
      tokenization: false
      min_ratio: 0.25
      max_ratio: 9223372036854775807

Python API

from data_juicer.ops.filter.alphanumeric_filter import AlphanumericFilter

op = AlphanumericFilter(tokenization=False, min_ratio=0.25)
# Apply to dataset
result = dataset.process(op)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment