Implementation:Datajuicer Data juicer WordRepetitionFilter
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Filtering |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for filtering data samples based on word-level n-gram repetition ratio provided by Data-Juicer.
Description
WordRepetitionFilter is a filter operator that keeps samples whose word-level n-gram repetition ratio falls within a specified range. It extends Filter and uses the two-phase compute_stats/process pattern. It tokenizes text using either a SentencePiece model or whitespace splitting, extracts word n-grams of configurable length (rep_len, default 10), counts duplicated n-grams, and computes word_rep_ratio as the fraction of n-gram occurrences that are repeated. Caches results and checks against [min_ratio, max_ratio] thresholds. Supports operator fusion via INTER_WORDS for reusing tokenized words. Adapted from HuggingFace's text data filtering. Complements CharacterRepetitionFilter by detecting repetition at the word level.
Usage
Import when filtering based on word repetition ratio. Configure in YAML or Python.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/filter/word_repetition_filter.py
Signature
@OPERATORS.register_module("word_repetition_filter")
class WordRepetitionFilter(Filter):
def __init__(self, lang: str = "en", tokenization: bool = False, rep_len: PositiveInt = 10, min_ratio: float = 0.0, max_ratio: float = 0.5, *args, **kwargs):
Import
from data_juicer.ops.filter.word_repetition_filter import WordRepetitionFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| lang | str | No | Language of the text (default: "en") |
| tokenization | bool | No | Whether to use model tokenizer (default: False) |
| rep_len | PositiveInt | No | N-gram length for repetition detection (default: 10) |
| min_ratio | float | No | Minimum repetition ratio (default: 0.0) |
| max_ratio | float | No | Maximum repetition ratio (default: 0.5) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Filtered samples with word_rep_ratio stat computed |
Usage Examples
YAML Configuration
process:
- word_repetition_filter:
lang: en
rep_len: 10
max_ratio: 0.5
Python API
from data_juicer.ops.filter.word_repetition_filter import WordRepetitionFilter
op = WordRepetitionFilter(rep_len=10, max_ratio=0.5)