Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer StopWordsFilter

From Leeroopedia
Knowledge Sources
Domains Data_Quality, Filtering
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for filtering data samples based on stopword ratio provided by Data-Juicer.

Description

StopWordsFilter is a filter operator that keeps samples where the ratio of stopwords to total words falls within a specified range. It extends Filter and uses the two-phase compute_stats/process pattern. It loads language-specific stopword lists from asset files, tokenizes text (optionally using a HuggingFace tokenizer), computes stopwords_ratio as the number of stopwords divided by total words, and caches the result. Supports word augmentation for languages like Chinese and Vietnamese, and operator fusion via INTER_WORDS for reusing tokenized words. Text with too few stopwords may be keyword lists, tables, or code rather than natural language.

Usage

Import when filtering based on stopword ratio. Configure in YAML or Python.

Code Reference

Source Location

Signature

@OPERATORS.register_module("stopwords_filter")
class StopWordsFilter(Filter):
    def __init__(self, lang: str = "en", tokenization: bool = False, min_ratio: float = 0.3, max_ratio: float = 1.0, stopwords_dir: str = ASSET_DIR, use_words_aug: bool = False, words_aug_group_sizes: List[PositiveInt] = [2], words_aug_join_char: str = "", *args, **kwargs):

Import

from data_juicer.ops.filter.stopwords_filter import StopWordsFilter

I/O Contract

Inputs

Name Type Required Description
lang str No Language for stopwords (default: "en"; use "all" for merged)
tokenization bool No Whether to use model tokenizer (default: False)
min_ratio float No Minimum stopword ratio (default: 0.3)
max_ratio float No Maximum stopword ratio (default: 1.0)
stopwords_dir str No Directory storing stopwords files (default: ASSET_DIR)
use_words_aug bool No Whether to augment words for Chinese/Vietnamese (default: False)
words_aug_group_sizes List[PositiveInt] No Group size of words to augment (default: [2])
words_aug_join_char str No Join char between augmented words (default: "")

Outputs

Name Type Description
samples Dict Filtered samples with stopwords_ratio stat computed

Usage Examples

YAML Configuration

process:
  - stopwords_filter:
      lang: en
      min_ratio: 0.3
      max_ratio: 1.0

Python API

from data_juicer.ops.filter.stopwords_filter import StopWordsFilter
op = StopWordsFilter(lang="en", min_ratio=0.3)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment