Implementation:Datajuicer Data juicer RemoveNonChineseCharacterlMapper

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Processing, Mapping
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for removing non-Chinese characters from text samples provided by Data-Juicer.

Description

RemoveNonChineseCharacterlMapper is a mapper operator that removes all characters outside the CJK Unified Ideographs range (U+4E00-U+9FA5) from text samples. It provides configurable options to preserve alphabetic letters, numbers, and/or punctuation alongside Chinese characters by constructing a regex character class dynamically. The regex is applied via re.sub to remove all non-matching characters in a batched operation.

Usage

Use when processing Chinese-language datasets where non-Chinese characters (stray Latin text, symbols, etc.) need to be stripped to produce pure Chinese-language training data.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/mapper/remove_non_chinese_character_mapper.py

Signature

@OPERATORS.register_module("remove_non_chinese_character_mapper")
class RemoveNonChineseCharacterlMapper(Mapper):
    def __init__(self, keep_alphabet: bool = True, keep_number: bool = True, keep_punc: bool = True, *args, **kwargs):

Import

from data_juicer.ops.mapper.remove_non_chinese_character_mapper import RemoveNonChineseCharacterlMapper

I/O Contract

Inputs

Name	Type	Required	Description
keep_alphabet	bool	No	Whether to keep alphabetic characters (default: True)
keep_number	bool	No	Whether to keep numeric characters (default: True)
keep_punc	bool	No	Whether to keep punctuation characters (default: True)

Outputs

Name	Type	Description
samples	Dict	Transformed samples with non-Chinese characters removed

Usage Examples

process:
  - remove_non_chinese_character_mapper:
      keep_alphabet: true
      keep_number: true
      keep_punc: false

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment