Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator Chinese Stopwords

From Leeroopedia
Knowledge Sources
Domains NLP, Text Processing, Chinese Language, Data Curation
Last Updated 2026-02-14 00:00 GMT

Overview

Defines a frozen set of approximately 795 Chinese stopwords used for stopword density filtering during HTML text extraction and quality assessment.

Description

The zh_stopwords module provides a single immutable data structure, a Python frozenset, containing common Chinese function words, particles, conjunctions, pronouns, punctuation marks, and full-width characters. The stopword list is sourced from the stopwords-iso/stopwords-zh project and is designed to support Chinese language text quality filtering in the NeMo Curator HTML extraction pipeline.

The frozenset includes several categories of tokens:

  • Function words and particles: Common grammatical words such as "的", "了", "在", "是", "和", "但", "或" and many others
  • Pronouns: Personal pronouns like "我", "你", "他", "她", "它" and their plural forms ("我们", "你们", etc.)
  • Conjunctions and connectives: Words like "因为", "所以", "但是", "然而", "如果", "虽然"
  • Adverbs and prepositions: Words like "很", "就", "都", "从", "对", "向"
  • Chinese punctuation: Characters such as "、", "。", "《", "》"
  • Full-width characters: Full-width equivalents of ASCII characters including "!", "#", "$", "%", "(", ")", "*", "+", and full-width digits "0" through "9"

Because the data structure is a frozenset, lookups are O(1) and the object is hashable and immutable, making it safe for concurrent use across multiple workers.

Usage

This module is used by the HTML text extraction pipeline to compute stopword density metrics for Chinese-language web content. By counting the proportion of tokens in a document that appear in this stopword list, the pipeline can assess whether extracted text contains natural Chinese language content or is primarily boilerplate, navigation elements, or non-textual content. Higher stopword density generally indicates more natural prose, while very low density may indicate extracted content that is not useful text.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/download/html_extractors/utils/zh_stopwords.py
  • Lines: 1-798

Signature

zh_stopwords = frozenset(
    [
        "、",
        "。",
        "〈",
        "〉",
        "《",
        "》",
        "一",
        "一个",
        "一些",
        # ... approximately 795 entries total
        "¥",
    ]
)

Import

from nemo_curator.stages.text.download.html_extractors.utils.zh_stopwords import zh_stopwords

I/O Contract

Inputs

Name Type Required Description
(none) N/A N/A This module is a static data resource with no input parameters. It exports a constant.

Outputs

Name Type Description
zh_stopwords frozenset[str] An immutable set of approximately 795 Chinese stopword strings, including function words, particles, conjunctions, pronouns, punctuation, and full-width characters.

Usage Examples

Basic Usage

from nemo_curator.stages.text.download.html_extractors.utils.zh_stopwords import zh_stopwords

# Check if a token is a Chinese stopword
token = "的"
if token in zh_stopwords:
    print(f"'{token}' is a Chinese stopword")

# Calculate stopword density for a tokenized text
tokens = ["这", "是", "一个", "非常", "好", "的", "例子"]
stopword_count = sum(1 for t in tokens if t in zh_stopwords)
density = stopword_count / len(tokens) if tokens else 0.0
print(f"Stopword density: {density:.2f}")

Filtering Application

from nemo_curator.stages.text.download.html_extractors.utils.zh_stopwords import zh_stopwords

def compute_chinese_stopword_density(text_tokens: list[str]) -> float:
    """Compute the fraction of tokens that are Chinese stopwords."""
    if not text_tokens:
        return 0.0
    count = sum(1 for token in text_tokens if token in zh_stopwords)
    return count / len(text_tokens)

# Use density as a quality signal for web-extracted Chinese text
tokens = ["我们", "认为", "这个", "方案", "是", "可行", "的"]
density = compute_chinese_stopword_density(tokens)
# A density in a reasonable range suggests natural Chinese prose

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment