Implementation:Datajuicer Data juicer AlphanumericFilter
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Filtering |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for filtering data samples based on alphanumeric character ratio provided by Data-Juicer.
Description
AlphanumericFilter is a filter operator that keeps samples with an alphabet/numeric ratio within a specific range. It calculates either an alnum_ratio (alphanumeric characters / total characters) or an alpha_token_ratio (alphabetic characters / tokenizer tokens) depending on the tokenization flag, optionally using a HuggingFace tokenizer (EleutherAI/pythia-6.9b-deduped). It extends the Filter base class and implements the two-phase compute_stats/process pattern.
Usage
Import this operator when you need to filter dataset samples based on the ratio of alphanumeric characters to total text length. Configure it in your Data-Juicer YAML config or instantiate directly.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/filter/alphanumeric_filter.py
- Lines: 1-89
Signature
@OPERATORS.register_module("alphanumeric_filter")
class AlphanumericFilter(Filter):
def __init__(
self, tokenization: bool = False, min_ratio: float = 0.25, max_ratio: float = sys.maxsize, *args, **kwargs
):
...
Import
from data_juicer.ops.filter.alphanumeric_filter import AlphanumericFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tokenization | bool | No | Whether to count the ratio of alphanumeric to the total number of tokens instead of characters. Default: False |
| min_ratio | float | No | The minimum filter ratio; samples below this are filtered out. Default: 0.25 |
| max_ratio | float | No | The maximum filter ratio; samples above this are filtered out. Default: sys.maxsize |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Filtered samples with stats field updated (alnum_ratio or alpha_token_ratio) |
Usage Examples
YAML Configuration
process:
- alphanumeric_filter:
tokenization: false
min_ratio: 0.25
max_ratio: 9223372036854775807
Python API
from data_juicer.ops.filter.alphanumeric_filter import AlphanumericFilter
op = AlphanumericFilter(tokenization=False, min_ratio=0.25)
# Apply to dataset
result = dataset.process(op)