Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer FlaggedWordFilter

From Leeroopedia
Knowledge Sources
Domains Data_Quality, Filtering
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for filtering data samples based on flagged-word ratio provided by Data-Juicer.

Description

FlaggedWordFilter is a filter operator that keeps samples with flagged-word ratio in a specified range. It uses a list of flagged words which can be language-specific or combined from multiple languages. The flagged-word ratio is computed as the number of flagged words divided by the total number of words. If tokenization is enabled, a HuggingFace sentencepiece tokenizer is used. The operator supports word augmentation for certain languages (e.g., Chinese and Vietnamese). The key metric flagged_words_ratio is cached in the stats field. It extends the Filter base class and implements the two-phase compute_stats/process pattern.

Usage

Import this operator when you need to filter dataset samples based on the proportion of flagged or inappropriate words. Configure it in your Data-Juicer YAML config or instantiate directly.

Code Reference

Source Location

Signature

@OPERATORS.register_module("flagged_words_filter")
@INTER_WORDS.register_module("flagged_words_filter")
class FlaggedWordFilter(Filter):
    def __init__(
        self,
        lang: str = "en",
        tokenization: bool = False,
        min_ratio: float = 0.0,
        max_ratio: float = 0.045,
        flagged_words_dir: str = ASSET_DIR,
        use_words_aug: bool = False,
        words_aug_group_sizes: List[PositiveInt] = [2],
        words_aug_join_char: str = "",
        *args,
        **kwargs,
    ):
        ...

Import

from data_juicer.ops.filter.flagged_words_filter import FlaggedWordFilter

I/O Contract

Inputs

Name Type Required Description
lang str No Language for flagged words list. Use "all" for merged list. Default: "en"
tokenization bool No Whether to use a model to tokenize documents. Default: False
min_ratio float No The minimum flagged words ratio. Default: 0.0
max_ratio float No The maximum flagged words ratio. Default: 0.045
flagged_words_dir str No Directory storing flagged_words files in JSON format. Default: ASSET_DIR
use_words_aug bool No Whether to augment words, especially for Chinese and Vietnamese. Default: False
words_aug_group_sizes List[PositiveInt] No The group size of words to augment. Default: [2]
words_aug_join_char str No The join character between words to augment. Default: ""

Outputs

Name Type Description
samples Dict Filtered samples with stats field updated (flagged_words_ratio)

Usage Examples

YAML Configuration

process:
  - flagged_words_filter:
      lang: "en"
      tokenization: false
      min_ratio: 0.0
      max_ratio: 0.045

Python API

from data_juicer.ops.filter.flagged_words_filter import FlaggedWordFilter

op = FlaggedWordFilter(lang="en", min_ratio=0.0, max_ratio=0.045)
# Apply to dataset
result = dataset.process(op)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment