Implementation:Datajuicer Data juicer NlpcdaZhMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for augmenting Chinese text using the nlpcda library provided by Data-Juicer.
Description
NlpcdaZhMapper is a mapper operator that augments Chinese text samples using various augmentation methods from the nlpcda library. It supports five methods: replace similar words, replace homophone characters, delete random characters, swap random characters, and replace numbers with equivalent representations (e.g., Arabic to Chinese numerals). Methods can be applied sequentially or independently, with configurable aug_num for output volume and keep_original_sample for retaining originals. Uses HiddenPrints to suppress verbose library output. Operates in batched mode. It is recommended to use 1-3 methods at a time to preserve semantics.
Usage
Use when you need Chinese-specific text data augmentation through linguistically-appropriate perturbations such as homophones and similar words, complementing the English-focused NlpaugEnMapper for multilingual dataset augmentation.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/nlpcda_zh_mapper.py
Signature
@OPERATORS.register_module("nlpcda_zh_mapper")
class NlpcdaZhMapper(Mapper):
def __init__(self,
sequential: bool = False,
aug_num: PositiveInt = 1,
keep_original_sample: bool = True,
replace_similar_word: bool = False,
replace_homophone_char: bool = False,
delete_random_char: bool = False,
swap_random_char: bool = False,
replace_equivalent_num: bool = False,
*args, **kwargs):
Import
from data_juicer.ops.mapper.nlpcda_zh_mapper import NlpcdaZhMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| sequential | bool | No | Whether to combine all methods in a sequence; defaults to False |
| aug_num | PositiveInt | No | Number of augmented samples to generate, defaults to 1 |
| keep_original_sample | bool | No | Whether to keep the original sample, defaults to True |
| replace_similar_word | bool | No | Replace random words with similar words, defaults to False |
| replace_homophone_char | bool | No | Replace random characters with homophones, defaults to False |
| delete_random_char | bool | No | Delete random characters from text, defaults to False |
| swap_random_char | bool | No | Swap random contiguous characters, defaults to False |
| replace_equivalent_num | bool | No | Replace numbers with equivalent representations (only numbers), defaults to False |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with augmented Chinese text entries added |
Usage Examples
process:
- nlpcda_zh_mapper:
sequential: false
aug_num: 1
keep_original_sample: true
replace_similar_word: true
replace_homophone_char: true