Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer RemoveNonChineseCharacterlMapper

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for removing non-Chinese characters from text samples provided by Data-Juicer.

Description

RemoveNonChineseCharacterlMapper is a mapper operator that removes all characters outside the CJK Unified Ideographs range (U+4E00-U+9FA5) from text samples. It provides configurable options to preserve alphabetic letters, numbers, and/or punctuation alongside Chinese characters by constructing a regex character class dynamically. The regex is applied via re.sub to remove all non-matching characters in a batched operation.

Usage

Use when processing Chinese-language datasets where non-Chinese characters (stray Latin text, symbols, etc.) need to be stripped to produce pure Chinese-language training data.

Code Reference

Source Location

Signature

@OPERATORS.register_module("remove_non_chinese_character_mapper")
class RemoveNonChineseCharacterlMapper(Mapper):
    def __init__(self, keep_alphabet: bool = True, keep_number: bool = True, keep_punc: bool = True, *args, **kwargs):

Import

from data_juicer.ops.mapper.remove_non_chinese_character_mapper import RemoveNonChineseCharacterlMapper

I/O Contract

Inputs

Name Type Required Description
keep_alphabet bool No Whether to keep alphabetic characters (default: True)
keep_number bool No Whether to keep numeric characters (default: True)
keep_punc bool No Whether to keep punctuation characters (default: True)

Outputs

Name Type Description
samples Dict Transformed samples with non-Chinese characters removed

Usage Examples

process:
  - remove_non_chinese_character_mapper:
      keep_alphabet: true
      keep_number: true
      keep_punc: false

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment