Implementation:Datajuicer Data juicer NlpaugEnMapper

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Processing, Mapping
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for augmenting English text using the nlpaug library provided by Data-Juicer.

Description

NlpaugEnMapper is a mapper operator that augments English text samples using various word-level and character-level augmentation methods from the nlpaug library. It supports nine augmentation methods: word-level (delete random word, swap random word, spelling error, split random word) and character-level (keyboard error, OCR error, delete random char, swap random char, insert random char). Methods can be applied sequentially or independently. The aug_num parameter controls augmented sample count, and keep_original_sample determines whether originals are retained. Operates in batched mode. It is recommended to use 1-3 methods at a time to preserve semantics.

Usage

Use when you need to create English text training data variations through controlled perturbations to improve model robustness to typos, OCR errors, and natural language variations.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/mapper/nlpaug_en_mapper.py

Signature

@OPERATORS.register_module("nlpaug_en_mapper")
class NlpaugEnMapper(Mapper):
    def __init__(self,
                 sequential: bool = False,
                 aug_num: PositiveInt = 1,
                 keep_original_sample: bool = True,
                 delete_random_word: bool = False,
                 swap_random_word: bool = False,
                 spelling_error_word: bool = False,
                 split_random_word: bool = False,
                 keyboard_error_char: bool = False,
                 ocr_error_char: bool = False,
                 delete_random_char: bool = False,
                 swap_random_char: bool = False,
                 insert_random_char: bool = False,
                 *args, **kwargs):

Import

from data_juicer.ops.mapper.nlpaug_en_mapper import NlpaugEnMapper

I/O Contract

Inputs

Name	Type	Required	Description
sequential	bool	No	Whether to combine all methods in a sequence; defaults to False
aug_num	PositiveInt	No	Number of augmented samples to generate, defaults to 1
keep_original_sample	bool	No	Whether to keep the original sample, defaults to True
delete_random_word	bool	No	Delete random words from text, defaults to False
swap_random_word	bool	No	Swap random contiguous words, defaults to False
spelling_error_word	bool	No	Simulate spelling errors for words, defaults to False
split_random_word	bool	No	Split words randomly with whitespace, defaults to False
keyboard_error_char	bool	No	Simulate keyboard errors for characters, defaults to False
ocr_error_char	bool	No	Simulate OCR errors for characters, defaults to False
delete_random_char	bool	No	Delete random characters from text, defaults to False
swap_random_char	bool	No	Swap random contiguous characters, defaults to False
insert_random_char	bool	No	Insert random characters into text, defaults to False

Outputs

Name	Type	Description
samples	Dict	Transformed samples with augmented text entries added

Usage Examples

process:
  - nlpaug_en_mapper:
      sequential: false
      aug_num: 1
      keep_original_sample: true
      spelling_error_word: true
      keyboard_error_char: true

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment