Implementation:Datajuicer Data juicer FlaggedWordFilter
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Filtering |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for filtering data samples based on flagged-word ratio provided by Data-Juicer.
Description
FlaggedWordFilter is a filter operator that keeps samples with flagged-word ratio in a specified range. It uses a list of flagged words which can be language-specific or combined from multiple languages. The flagged-word ratio is computed as the number of flagged words divided by the total number of words. If tokenization is enabled, a HuggingFace sentencepiece tokenizer is used. The operator supports word augmentation for certain languages (e.g., Chinese and Vietnamese). The key metric flagged_words_ratio is cached in the stats field. It extends the Filter base class and implements the two-phase compute_stats/process pattern.
Usage
Import this operator when you need to filter dataset samples based on the proportion of flagged or inappropriate words. Configure it in your Data-Juicer YAML config or instantiate directly.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/filter/flagged_words_filter.py
- Lines: 1-149
Signature
@OPERATORS.register_module("flagged_words_filter")
@INTER_WORDS.register_module("flagged_words_filter")
class FlaggedWordFilter(Filter):
def __init__(
self,
lang: str = "en",
tokenization: bool = False,
min_ratio: float = 0.0,
max_ratio: float = 0.045,
flagged_words_dir: str = ASSET_DIR,
use_words_aug: bool = False,
words_aug_group_sizes: List[PositiveInt] = [2],
words_aug_join_char: str = "",
*args,
**kwargs,
):
...
Import
from data_juicer.ops.filter.flagged_words_filter import FlaggedWordFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| lang | str | No | Language for flagged words list. Use "all" for merged list. Default: "en" |
| tokenization | bool | No | Whether to use a model to tokenize documents. Default: False |
| min_ratio | float | No | The minimum flagged words ratio. Default: 0.0 |
| max_ratio | float | No | The maximum flagged words ratio. Default: 0.045 |
| flagged_words_dir | str | No | Directory storing flagged_words files in JSON format. Default: ASSET_DIR |
| use_words_aug | bool | No | Whether to augment words, especially for Chinese and Vietnamese. Default: False |
| words_aug_group_sizes | List[PositiveInt] | No | The group size of words to augment. Default: [2] |
| words_aug_join_char | str | No | The join character between words to augment. Default: "" |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Filtered samples with stats field updated (flagged_words_ratio) |
Usage Examples
YAML Configuration
process:
- flagged_words_filter:
lang: "en"
tokenization: false
min_ratio: 0.0
max_ratio: 0.045
Python API
from data_juicer.ops.filter.flagged_words_filter import FlaggedWordFilter
op = FlaggedWordFilter(lang="en", min_ratio=0.0, max_ratio=0.045)
# Apply to dataset
result = dataset.process(op)