Implementation:Datajuicer Data juicer NlpaugEnMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for augmenting English text using the nlpaug library provided by Data-Juicer.
Description
NlpaugEnMapper is a mapper operator that augments English text samples using various word-level and character-level augmentation methods from the nlpaug library. It supports nine augmentation methods: word-level (delete random word, swap random word, spelling error, split random word) and character-level (keyboard error, OCR error, delete random char, swap random char, insert random char). Methods can be applied sequentially or independently. The aug_num parameter controls augmented sample count, and keep_original_sample determines whether originals are retained. Operates in batched mode. It is recommended to use 1-3 methods at a time to preserve semantics.
Usage
Use when you need to create English text training data variations through controlled perturbations to improve model robustness to typos, OCR errors, and natural language variations.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/nlpaug_en_mapper.py
Signature
@OPERATORS.register_module("nlpaug_en_mapper")
class NlpaugEnMapper(Mapper):
def __init__(self,
sequential: bool = False,
aug_num: PositiveInt = 1,
keep_original_sample: bool = True,
delete_random_word: bool = False,
swap_random_word: bool = False,
spelling_error_word: bool = False,
split_random_word: bool = False,
keyboard_error_char: bool = False,
ocr_error_char: bool = False,
delete_random_char: bool = False,
swap_random_char: bool = False,
insert_random_char: bool = False,
*args, **kwargs):
Import
from data_juicer.ops.mapper.nlpaug_en_mapper import NlpaugEnMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| sequential | bool | No | Whether to combine all methods in a sequence; defaults to False |
| aug_num | PositiveInt | No | Number of augmented samples to generate, defaults to 1 |
| keep_original_sample | bool | No | Whether to keep the original sample, defaults to True |
| delete_random_word | bool | No | Delete random words from text, defaults to False |
| swap_random_word | bool | No | Swap random contiguous words, defaults to False |
| spelling_error_word | bool | No | Simulate spelling errors for words, defaults to False |
| split_random_word | bool | No | Split words randomly with whitespace, defaults to False |
| keyboard_error_char | bool | No | Simulate keyboard errors for characters, defaults to False |
| ocr_error_char | bool | No | Simulate OCR errors for characters, defaults to False |
| delete_random_char | bool | No | Delete random characters from text, defaults to False |
| swap_random_char | bool | No | Swap random contiguous characters, defaults to False |
| insert_random_char | bool | No | Insert random characters into text, defaults to False |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with augmented text entries added |
Usage Examples
process:
- nlpaug_en_mapper:
sequential: false
aug_num: 1
keep_original_sample: true
spelling_error_word: true
keyboard_error_char: true