Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer NlpcdaZhMapper

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for augmenting Chinese text using the nlpcda library provided by Data-Juicer.

Description

NlpcdaZhMapper is a mapper operator that augments Chinese text samples using various augmentation methods from the nlpcda library. It supports five methods: replace similar words, replace homophone characters, delete random characters, swap random characters, and replace numbers with equivalent representations (e.g., Arabic to Chinese numerals). Methods can be applied sequentially or independently, with configurable aug_num for output volume and keep_original_sample for retaining originals. Uses HiddenPrints to suppress verbose library output. Operates in batched mode. It is recommended to use 1-3 methods at a time to preserve semantics.

Usage

Use when you need Chinese-specific text data augmentation through linguistically-appropriate perturbations such as homophones and similar words, complementing the English-focused NlpaugEnMapper for multilingual dataset augmentation.

Code Reference

Source Location

Signature

@OPERATORS.register_module("nlpcda_zh_mapper")
class NlpcdaZhMapper(Mapper):
    def __init__(self,
                 sequential: bool = False,
                 aug_num: PositiveInt = 1,
                 keep_original_sample: bool = True,
                 replace_similar_word: bool = False,
                 replace_homophone_char: bool = False,
                 delete_random_char: bool = False,
                 swap_random_char: bool = False,
                 replace_equivalent_num: bool = False,
                 *args, **kwargs):

Import

from data_juicer.ops.mapper.nlpcda_zh_mapper import NlpcdaZhMapper

I/O Contract

Inputs

Name Type Required Description
sequential bool No Whether to combine all methods in a sequence; defaults to False
aug_num PositiveInt No Number of augmented samples to generate, defaults to 1
keep_original_sample bool No Whether to keep the original sample, defaults to True
replace_similar_word bool No Replace random words with similar words, defaults to False
replace_homophone_char bool No Replace random characters with homophones, defaults to False
delete_random_char bool No Delete random characters from text, defaults to False
swap_random_char bool No Swap random contiguous characters, defaults to False
replace_equivalent_num bool No Replace numbers with equivalent representations (only numbers), defaults to False

Outputs

Name Type Description
samples Dict Transformed samples with augmented Chinese text entries added

Usage Examples

process:
  - nlpcda_zh_mapper:
      sequential: false
      aug_num: 1
      keep_original_sample: true
      replace_similar_word: true
      replace_homophone_char: true

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment