Implementation:Datajuicer Data juicer StopWordsFilter

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Quality, Filtering
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for filtering data samples based on stopword ratio provided by Data-Juicer.

Description

StopWordsFilter is a filter operator that keeps samples where the ratio of stopwords to total words falls within a specified range. It extends Filter and uses the two-phase compute_stats/process pattern. It loads language-specific stopword lists from asset files, tokenizes text (optionally using a HuggingFace tokenizer), computes stopwords_ratio as the number of stopwords divided by total words, and caches the result. Supports word augmentation for languages like Chinese and Vietnamese, and operator fusion via INTER_WORDS for reusing tokenized words. Text with too few stopwords may be keyword lists, tables, or code rather than natural language.

Usage

Import when filtering based on stopword ratio. Configure in YAML or Python.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/filter/stopwords_filter.py

Signature

@OPERATORS.register_module("stopwords_filter")
class StopWordsFilter(Filter):
    def __init__(self, lang: str = "en", tokenization: bool = False, min_ratio: float = 0.3, max_ratio: float = 1.0, stopwords_dir: str = ASSET_DIR, use_words_aug: bool = False, words_aug_group_sizes: List[PositiveInt] = [2], words_aug_join_char: str = "", *args, **kwargs):

Import

from data_juicer.ops.filter.stopwords_filter import StopWordsFilter

I/O Contract

Inputs

Name	Type	Required	Description
lang	str	No	Language for stopwords (default: "en"; use "all" for merged)
tokenization	bool	No	Whether to use model tokenizer (default: False)
min_ratio	float	No	Minimum stopword ratio (default: 0.3)
max_ratio	float	No	Maximum stopword ratio (default: 1.0)
stopwords_dir	str	No	Directory storing stopwords files (default: ASSET_DIR)
use_words_aug	bool	No	Whether to augment words for Chinese/Vietnamese (default: False)
words_aug_group_sizes	List[PositiveInt]	No	Group size of words to augment (default: [2])
words_aug_join_char	str	No	Join char between augmented words (default: "")

Outputs

Name	Type	Description
samples	Dict	Filtered samples with stopwords_ratio stat computed

Usage Examples

YAML Configuration

process:
  - stopwords_filter:
      lang: en
      min_ratio: 0.3
      max_ratio: 1.0

Python API

from data_juicer.ops.filter.stopwords_filter import StopWordsFilter
op = StopWordsFilter(lang="en", min_ratio=0.3)

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment