Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer NlpaugEnMapper

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for augmenting English text using the nlpaug library provided by Data-Juicer.

Description

NlpaugEnMapper is a mapper operator that augments English text samples using various word-level and character-level augmentation methods from the nlpaug library. It supports nine augmentation methods: word-level (delete random word, swap random word, spelling error, split random word) and character-level (keyboard error, OCR error, delete random char, swap random char, insert random char). Methods can be applied sequentially or independently. The aug_num parameter controls augmented sample count, and keep_original_sample determines whether originals are retained. Operates in batched mode. It is recommended to use 1-3 methods at a time to preserve semantics.

Usage

Use when you need to create English text training data variations through controlled perturbations to improve model robustness to typos, OCR errors, and natural language variations.

Code Reference

Source Location

Signature

@OPERATORS.register_module("nlpaug_en_mapper")
class NlpaugEnMapper(Mapper):
    def __init__(self,
                 sequential: bool = False,
                 aug_num: PositiveInt = 1,
                 keep_original_sample: bool = True,
                 delete_random_word: bool = False,
                 swap_random_word: bool = False,
                 spelling_error_word: bool = False,
                 split_random_word: bool = False,
                 keyboard_error_char: bool = False,
                 ocr_error_char: bool = False,
                 delete_random_char: bool = False,
                 swap_random_char: bool = False,
                 insert_random_char: bool = False,
                 *args, **kwargs):

Import

from data_juicer.ops.mapper.nlpaug_en_mapper import NlpaugEnMapper

I/O Contract

Inputs

Name Type Required Description
sequential bool No Whether to combine all methods in a sequence; defaults to False
aug_num PositiveInt No Number of augmented samples to generate, defaults to 1
keep_original_sample bool No Whether to keep the original sample, defaults to True
delete_random_word bool No Delete random words from text, defaults to False
swap_random_word bool No Swap random contiguous words, defaults to False
spelling_error_word bool No Simulate spelling errors for words, defaults to False
split_random_word bool No Split words randomly with whitespace, defaults to False
keyboard_error_char bool No Simulate keyboard errors for characters, defaults to False
ocr_error_char bool No Simulate OCR errors for characters, defaults to False
delete_random_char bool No Delete random characters from text, defaults to False
swap_random_char bool No Swap random contiguous characters, defaults to False
insert_random_char bool No Insert random characters into text, defaults to False

Outputs

Name Type Description
samples Dict Transformed samples with augmented text entries added

Usage Examples

process:
  - nlpaug_en_mapper:
      sequential: false
      aug_num: 1
      keep_original_sample: true
      spelling_error_word: true
      keyboard_error_char: true

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment