Implementation:Datajuicer Data juicer NlpcdaZhMapper

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Processing, Mapping
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for augmenting Chinese text using the nlpcda library provided by Data-Juicer.

Description

NlpcdaZhMapper is a mapper operator that augments Chinese text samples using various augmentation methods from the nlpcda library. It supports five methods: replace similar words, replace homophone characters, delete random characters, swap random characters, and replace numbers with equivalent representations (e.g., Arabic to Chinese numerals). Methods can be applied sequentially or independently, with configurable aug_num for output volume and keep_original_sample for retaining originals. Uses HiddenPrints to suppress verbose library output. Operates in batched mode. It is recommended to use 1-3 methods at a time to preserve semantics.

Usage

Use when you need Chinese-specific text data augmentation through linguistically-appropriate perturbations such as homophones and similar words, complementing the English-focused NlpaugEnMapper for multilingual dataset augmentation.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/mapper/nlpcda_zh_mapper.py

Signature

@OPERATORS.register_module("nlpcda_zh_mapper")
class NlpcdaZhMapper(Mapper):
    def __init__(self,
                 sequential: bool = False,
                 aug_num: PositiveInt = 1,
                 keep_original_sample: bool = True,
                 replace_similar_word: bool = False,
                 replace_homophone_char: bool = False,
                 delete_random_char: bool = False,
                 swap_random_char: bool = False,
                 replace_equivalent_num: bool = False,
                 *args, **kwargs):

Import

from data_juicer.ops.mapper.nlpcda_zh_mapper import NlpcdaZhMapper

I/O Contract

Inputs

Name	Type	Required	Description
sequential	bool	No	Whether to combine all methods in a sequence; defaults to False
aug_num	PositiveInt	No	Number of augmented samples to generate, defaults to 1
keep_original_sample	bool	No	Whether to keep the original sample, defaults to True
replace_similar_word	bool	No	Replace random words with similar words, defaults to False
replace_homophone_char	bool	No	Replace random characters with homophones, defaults to False
delete_random_char	bool	No	Delete random characters from text, defaults to False
swap_random_char	bool	No	Swap random contiguous characters, defaults to False
replace_equivalent_num	bool	No	Replace numbers with equivalent representations (only numbers), defaults to False

Outputs

Name	Type	Description
samples	Dict	Transformed samples with augmented Chinese text entries added

Usage Examples

process:
  - nlpcda_zh_mapper:
      sequential: false
      aug_num: 1
      keep_original_sample: true
      replace_similar_word: true
      replace_homophone_char: true

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment