Implementation:Datajuicer Data juicer StopWordsFilter
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Filtering |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for filtering data samples based on stopword ratio provided by Data-Juicer.
Description
StopWordsFilter is a filter operator that keeps samples where the ratio of stopwords to total words falls within a specified range. It extends Filter and uses the two-phase compute_stats/process pattern. It loads language-specific stopword lists from asset files, tokenizes text (optionally using a HuggingFace tokenizer), computes stopwords_ratio as the number of stopwords divided by total words, and caches the result. Supports word augmentation for languages like Chinese and Vietnamese, and operator fusion via INTER_WORDS for reusing tokenized words. Text with too few stopwords may be keyword lists, tables, or code rather than natural language.
Usage
Import when filtering based on stopword ratio. Configure in YAML or Python.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/filter/stopwords_filter.py
Signature
@OPERATORS.register_module("stopwords_filter")
class StopWordsFilter(Filter):
def __init__(self, lang: str = "en", tokenization: bool = False, min_ratio: float = 0.3, max_ratio: float = 1.0, stopwords_dir: str = ASSET_DIR, use_words_aug: bool = False, words_aug_group_sizes: List[PositiveInt] = [2], words_aug_join_char: str = "", *args, **kwargs):
Import
from data_juicer.ops.filter.stopwords_filter import StopWordsFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| lang | str | No | Language for stopwords (default: "en"; use "all" for merged) |
| tokenization | bool | No | Whether to use model tokenizer (default: False) |
| min_ratio | float | No | Minimum stopword ratio (default: 0.3) |
| max_ratio | float | No | Maximum stopword ratio (default: 1.0) |
| stopwords_dir | str | No | Directory storing stopwords files (default: ASSET_DIR) |
| use_words_aug | bool | No | Whether to augment words for Chinese/Vietnamese (default: False) |
| words_aug_group_sizes | List[PositiveInt] | No | Group size of words to augment (default: [2]) |
| words_aug_join_char | str | No | Join char between augmented words (default: "") |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Filtered samples with stopwords_ratio stat computed |
Usage Examples
YAML Configuration
process:
- stopwords_filter:
lang: en
min_ratio: 0.3
max_ratio: 1.0
Python API
from data_juicer.ops.filter.stopwords_filter import StopWordsFilter
op = StopWordsFilter(lang="en", min_ratio=0.3)