Implementation:NVIDIA NeMo Curator Chinese Stopwords
| Knowledge Sources | |
|---|---|
| Domains | NLP, Text Processing, Chinese Language, Data Curation |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Defines a frozen set of approximately 795 Chinese stopwords used for stopword density filtering during HTML text extraction and quality assessment.
Description
The zh_stopwords module provides a single immutable data structure, a Python frozenset, containing common Chinese function words, particles, conjunctions, pronouns, punctuation marks, and full-width characters. The stopword list is sourced from the stopwords-iso/stopwords-zh project and is designed to support Chinese language text quality filtering in the NeMo Curator HTML extraction pipeline.
The frozenset includes several categories of tokens:
- Function words and particles: Common grammatical words such as "的", "了", "在", "是", "和", "但", "或" and many others
- Pronouns: Personal pronouns like "我", "你", "他", "她", "它" and their plural forms ("我们", "你们", etc.)
- Conjunctions and connectives: Words like "因为", "所以", "但是", "然而", "如果", "虽然"
- Adverbs and prepositions: Words like "很", "就", "都", "从", "对", "向"
- Chinese punctuation: Characters such as "、", "。", "《", "》"
- Full-width characters: Full-width equivalents of ASCII characters including "!", "#", "$", "%", "(", ")", "*", "+", and full-width digits "0" through "9"
Because the data structure is a frozenset, lookups are O(1) and the object is hashable and immutable, making it safe for concurrent use across multiple workers.
Usage
This module is used by the HTML text extraction pipeline to compute stopword density metrics for Chinese-language web content. By counting the proportion of tokens in a document that appear in this stopword list, the pipeline can assess whether extracted text contains natural Chinese language content or is primarily boilerplate, navigation elements, or non-textual content. Higher stopword density generally indicates more natural prose, while very low density may indicate extracted content that is not useful text.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/download/html_extractors/utils/zh_stopwords.py - Lines: 1-798
Signature
zh_stopwords = frozenset(
[
"、",
"。",
"〈",
"〉",
"《",
"》",
"一",
"一个",
"一些",
# ... approximately 795 entries total
"¥",
]
)
Import
from nemo_curator.stages.text.download.html_extractors.utils.zh_stopwords import zh_stopwords
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (none) | N/A | N/A | This module is a static data resource with no input parameters. It exports a constant. |
Outputs
| Name | Type | Description |
|---|---|---|
| zh_stopwords | frozenset[str] |
An immutable set of approximately 795 Chinese stopword strings, including function words, particles, conjunctions, pronouns, punctuation, and full-width characters. |
Usage Examples
Basic Usage
from nemo_curator.stages.text.download.html_extractors.utils.zh_stopwords import zh_stopwords
# Check if a token is a Chinese stopword
token = "的"
if token in zh_stopwords:
print(f"'{token}' is a Chinese stopword")
# Calculate stopword density for a tokenized text
tokens = ["这", "是", "一个", "非常", "好", "的", "例子"]
stopword_count = sum(1 for t in tokens if t in zh_stopwords)
density = stopword_count / len(tokens) if tokens else 0.0
print(f"Stopword density: {density:.2f}")
Filtering Application
from nemo_curator.stages.text.download.html_extractors.utils.zh_stopwords import zh_stopwords
def compute_chinese_stopword_density(text_tokens: list[str]) -> float:
"""Compute the fraction of tokens that are Chinese stopwords."""
if not text_tokens:
return 0.0
count = sum(1 for token in text_tokens if token in zh_stopwords)
return count / len(text_tokens)
# Use density as a quality signal for web-extracted Chinese text
tokens = ["我们", "认为", "这个", "方案", "是", "可行", "的"]
density = compute_chinese_stopword_density(tokens)
# A density in a reasonable range suggests natural Chinese prose