Implementation:Datajuicer Data juicer RemoveNonChineseCharacterlMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for removing non-Chinese characters from text samples provided by Data-Juicer.
Description
RemoveNonChineseCharacterlMapper is a mapper operator that removes all characters outside the CJK Unified Ideographs range (U+4E00-U+9FA5) from text samples. It provides configurable options to preserve alphabetic letters, numbers, and/or punctuation alongside Chinese characters by constructing a regex character class dynamically. The regex is applied via re.sub to remove all non-matching characters in a batched operation.
Usage
Use when processing Chinese-language datasets where non-Chinese characters (stray Latin text, symbols, etc.) need to be stripped to produce pure Chinese-language training data.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/remove_non_chinese_character_mapper.py
Signature
@OPERATORS.register_module("remove_non_chinese_character_mapper")
class RemoveNonChineseCharacterlMapper(Mapper):
def __init__(self, keep_alphabet: bool = True, keep_number: bool = True, keep_punc: bool = True, *args, **kwargs):
Import
from data_juicer.ops.mapper.remove_non_chinese_character_mapper import RemoveNonChineseCharacterlMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| keep_alphabet | bool | No | Whether to keep alphabetic characters (default: True) |
| keep_number | bool | No | Whether to keep numeric characters (default: True) |
| keep_punc | bool | No | Whether to keep punctuation characters (default: True) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with non-Chinese characters removed |
Usage Examples
process:
- remove_non_chinese_character_mapper:
keep_alphabet: true
keep_number: true
keep_punc: false