Implementation:NVIDIA NeMo Curator Chinese Stopwords

Knowledge Sources	NVIDIA NeMo Curator
Domains	NLP, Text Processing, Chinese Language, Data Curation
Last Updated	2026-02-14 00:00 GMT

Overview

Defines a frozen set of approximately 795 Chinese stopwords used for stopword density filtering during HTML text extraction and quality assessment.

Description

The zh_stopwords module provides a single immutable data structure, a Python frozenset, containing common Chinese function words, particles, conjunctions, pronouns, punctuation marks, and full-width characters. The stopword list is sourced from the stopwords-iso/stopwords-zh project and is designed to support Chinese language text quality filtering in the NeMo Curator HTML extraction pipeline.

The frozenset includes several categories of tokens:

Function words and particles: Common grammatical words such as "的", "了", "在", "是", "和", "但", "或" and many others
Pronouns: Personal pronouns like "我", "你", "他", "她", "它" and their plural forms ("我们", "你们", etc.)
Conjunctions and connectives: Words like "因为", "所以", "但是", "然而", "如果", "虽然"
Adverbs and prepositions: Words like "很", "就", "都", "从", "对", "向"
Chinese punctuation: Characters such as "、", "。", "《", "》"
Full-width characters: Full-width equivalents of ASCII characters including "！", "＃", "＄", "％", "（", "）", "＊", "＋", and full-width digits "０" through "９"

Because the data structure is a frozenset, lookups are O(1) and the object is hashable and immutable, making it safe for concurrent use across multiple workers.

Usage

This module is used by the HTML text extraction pipeline to compute stopword density metrics for Chinese-language web content. By counting the proportion of tokens in a document that appear in this stopword list, the pipeline can assess whether extracted text contains natural Chinese language content or is primarily boilerplate, navigation elements, or non-textual content. Higher stopword density generally indicates more natural prose, while very low density may indicate extracted content that is not useful text.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/text/download/html_extractors/utils/zh_stopwords.py
Lines: 1-798

Signature

zh_stopwords = frozenset(
    [
        "、",
        "。",
        "〈",
        "〉",
        "《",
        "》",
        "一",
        "一个",
        "一些",
        # ... approximately 795 entries total
        "￥",
    ]
)

Import

from nemo_curator.stages.text.download.html_extractors.utils.zh_stopwords import zh_stopwords

I/O Contract

Inputs

Name	Type	Required	Description
(none)	N/A	N/A	This module is a static data resource with no input parameters. It exports a constant.

Outputs

Name	Type	Description
zh_stopwords	`frozenset[str]`	An immutable set of approximately 795 Chinese stopword strings, including function words, particles, conjunctions, pronouns, punctuation, and full-width characters.

Usage Examples

Basic Usage

from nemo_curator.stages.text.download.html_extractors.utils.zh_stopwords import zh_stopwords

# Check if a token is a Chinese stopword
token = "的"
if token in zh_stopwords:
    print(f"'{token}' is a Chinese stopword")

# Calculate stopword density for a tokenized text
tokens = ["这", "是", "一个", "非常", "好", "的", "例子"]
stopword_count = sum(1 for t in tokens if t in zh_stopwords)
density = stopword_count / len(tokens) if tokens else 0.0
print(f"Stopword density: {density:.2f}")

Filtering Application

from nemo_curator.stages.text.download.html_extractors.utils.zh_stopwords import zh_stopwords

def compute_chinese_stopword_density(text_tokens: list[str]) -> float:
    """Compute the fraction of tokens that are Chinese stopwords."""
    if not text_tokens:
        return 0.0
    count = sum(1 for token in text_tokens if token in zh_stopwords)
    return count / len(text_tokens)

# Use density as a quality signal for web-extracted Chinese text
tokens = ["我们", "认为", "这个", "方案", "是", "可行", "的"]
density = compute_chinese_stopword_density(tokens)
# A density in a reasonable range suggests natural Chinese prose

Related Pages

Environment:NVIDIA_NeMo_Curator_Python_Linux_Base

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment