Implementation:Datajuicer Data juicer FlaggedWordFilter

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Quality, Filtering
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for filtering data samples based on flagged-word ratio provided by Data-Juicer.

Description

FlaggedWordFilter is a filter operator that keeps samples with flagged-word ratio in a specified range. It uses a list of flagged words which can be language-specific or combined from multiple languages. The flagged-word ratio is computed as the number of flagged words divided by the total number of words. If tokenization is enabled, a HuggingFace sentencepiece tokenizer is used. The operator supports word augmentation for certain languages (e.g., Chinese and Vietnamese). The key metric flagged_words_ratio is cached in the stats field. It extends the Filter base class and implements the two-phase compute_stats/process pattern.

Usage

Import this operator when you need to filter dataset samples based on the proportion of flagged or inappropriate words. Configure it in your Data-Juicer YAML config or instantiate directly.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/filter/flagged_words_filter.py
Lines: 1-149

Signature

@OPERATORS.register_module("flagged_words_filter")
@INTER_WORDS.register_module("flagged_words_filter")
class FlaggedWordFilter(Filter):
    def __init__(
        self,
        lang: str = "en",
        tokenization: bool = False,
        min_ratio: float = 0.0,
        max_ratio: float = 0.045,
        flagged_words_dir: str = ASSET_DIR,
        use_words_aug: bool = False,
        words_aug_group_sizes: List[PositiveInt] = [2],
        words_aug_join_char: str = "",
        *args,
        **kwargs,
    ):
        ...

Import

from data_juicer.ops.filter.flagged_words_filter import FlaggedWordFilter

I/O Contract

Inputs

Name	Type	Required	Description
lang	str	No	Language for flagged words list. Use "all" for merged list. Default: "en"
tokenization	bool	No	Whether to use a model to tokenize documents. Default: False
min_ratio	float	No	The minimum flagged words ratio. Default: 0.0
max_ratio	float	No	The maximum flagged words ratio. Default: 0.045
flagged_words_dir	str	No	Directory storing flagged_words files in JSON format. Default: ASSET_DIR
use_words_aug	bool	No	Whether to augment words, especially for Chinese and Vietnamese. Default: False
words_aug_group_sizes	List[PositiveInt]	No	The group size of words to augment. Default: [2]
words_aug_join_char	str	No	The join character between words to augment. Default: ""

Outputs

Name	Type	Description
samples	Dict	Filtered samples with stats field updated (flagged_words_ratio)

Usage Examples

YAML Configuration

process:
  - flagged_words_filter:
      lang: "en"
      tokenization: false
      min_ratio: 0.0
      max_ratio: 0.045

Python API

from data_juicer.ops.filter.flagged_words_filter import FlaggedWordFilter

op = FlaggedWordFilter(lang="en", min_ratio=0.0, max_ratio=0.045)
# Apply to dataset
result = dataset.process(op)

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment