Implementation:LLMBook zh LLMBook zh github io CleanerSubstitutePassageIDCard Clean Single Text
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Preprocessing, Privacy |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for masking Chinese ID card numbers in text passages provided by the LLMBook repository.
Description
The CleanerSubstitutePassageIDCard class extends CleanerBase to perform regex-based substitution of Chinese national ID card numbers. It uses a pre-defined regex pattern (REGEX_IDCARD) to find 18-digit ID card numbers and replaces them with a configurable mask token.
Usage
Import this class when you need to mask Chinese ID card numbers in text during the privacy filtering stage of data preprocessing, after deduplication and before tokenization.
Code Reference
Source Location
- Repository: LLMBook-zh
- File: code/4.3 隐私过滤.py
- Lines: 4-9
Signature
class CleanerSubstitutePassageIDCard(CleanerBase):
def __init__(self):
"""Initializes the cleaner (calls super().__init__())."""
def clean_single_text(self, text: str, repl_text: str = "**MASKED**IDCARD**") -> str:
"""
Replaces Chinese ID card numbers in text with a mask token.
Args:
text: Input text containing potential PII.
repl_text: Replacement string for masked IDs (default "**MASKED**IDCARD**").
Returns:
Text with Chinese ID card numbers replaced by the mask token.
"""
Import
from utils.cleaner.cleaner_base import CleanerBase
from privacy_filter import CleanerSubstitutePassageIDCard
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| text | str | Yes | Text passage containing potential PII |
| repl_text | str | No | Replacement mask token (default "**MASKED**IDCARD**") |
Outputs
| Name | Type | Description |
|---|---|---|
| return | str | Text with Chinese ID card numbers replaced by mask token |
Usage Examples
from privacy_filter import CleanerSubstitutePassageIDCard
masker = CleanerSubstitutePassageIDCard()
text = "用户张三的身份证号是110101199001011234,请核实。"
cleaned = masker.clean_single_text(text)
print(cleaned)
# Output: "用户张三的身份证号是**MASKED**IDCARD**,请核实。"