Implementation:Datajuicer Data juicer RemoveLongWordsMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for removing words outside a specified length range provided by Data-Juicer.
Description
RemoveLongWordsMapper is a mapper operator that filters out words in the text that are either shorter than the specified minimum length or longer than the specified maximum length. Words are first checked with their original length, and if they exceed the maximum, they are stripped of special characters and re-evaluated. The text is split on whitespace, tab, and newline boundaries, and only words within the defined length range are retained and reassembled. Operates in batched mode.
Usage
Use when cleaning text data to remove garbled tokens, base64 strings, long URLs, and other non-natural-language artifacts from training data.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/remove_long_words_mapper.py
Signature
@OPERATORS.register_module("remove_long_words_mapper")
class RemoveLongWordsMapper(Mapper):
def __init__(self, min_len: int = 1, max_len: int = sys.maxsize, *args, **kwargs):
Import
from data_juicer.ops.mapper.remove_long_words_mapper import RemoveLongWordsMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| min_len | int | No | Minimum word length to keep (default: 1) |
| max_len | int | No | Maximum word length to keep (default: sys.maxsize) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with out-of-range words removed |
Usage Examples
process:
- remove_long_words_mapper:
min_len: 1
max_len: 40