Implementation:Datajuicer Data juicer WhitespaceNormalizationMapper

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Processing, Mapping
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for normalizing whitespace characters in text samples provided by Data-Juicer.

Description

WhitespaceNormalizationMapper normalizes all types of whitespace characters (tabs, newlines, non-breaking spaces, and other Unicode whitespace) to standard ASCII spaces (0x20), and trims leading and trailing whitespace from text samples. It iterates over each character and replaces any character found in a comprehensive list of Unicode whitespace characters (from the VARIOUS_WHITESPACES constant) with a standard space character, processing in batched mode for efficiency. The operator is based on the whitespace character list from the bigscience data-preparation project.

Usage

Use when you need to ensure consistent whitespace encoding across training data, preventing tokenization issues caused by non-standard whitespace characters such as tabs, newlines, and Unicode whitespace variants.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/mapper/whitespace_normalization_mapper.py

Signature

@OPERATORS.register_module("whitespace_normalization_mapper")
class WhitespaceNormalizationMapper(Mapper):
    def __init__(self, *args, **kwargs):

Import

from data_juicer.ops.mapper.whitespace_normalization_mapper import WhitespaceNormalizationMapper

I/O Contract

Inputs

Name	Type	Required	Description
sample[text_key]	str	Yes	The text content to normalize whitespace in

Outputs

Name	Type	Description
sample[text_key]	str	Text with all whitespace characters normalized to standard spaces and leading/trailing whitespace trimmed

Usage Examples

process:
  - whitespace_normalization_mapper:

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment