Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer WhitespaceNormalizationMapper

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for normalizing whitespace characters in text samples provided by Data-Juicer.

Description

WhitespaceNormalizationMapper normalizes all types of whitespace characters (tabs, newlines, non-breaking spaces, and other Unicode whitespace) to standard ASCII spaces (0x20), and trims leading and trailing whitespace from text samples. It iterates over each character and replaces any character found in a comprehensive list of Unicode whitespace characters (from the VARIOUS_WHITESPACES constant) with a standard space character, processing in batched mode for efficiency. The operator is based on the whitespace character list from the bigscience data-preparation project.

Usage

Use when you need to ensure consistent whitespace encoding across training data, preventing tokenization issues caused by non-standard whitespace characters such as tabs, newlines, and Unicode whitespace variants.

Code Reference

Source Location

Signature

@OPERATORS.register_module("whitespace_normalization_mapper")
class WhitespaceNormalizationMapper(Mapper):
    def __init__(self, *args, **kwargs):

Import

from data_juicer.ops.mapper.whitespace_normalization_mapper import WhitespaceNormalizationMapper

I/O Contract

Inputs

Name Type Required Description
sample[text_key] str Yes The text content to normalize whitespace in

Outputs

Name Type Description
sample[text_key] str Text with all whitespace characters normalized to standard spaces and leading/trailing whitespace trimmed

Usage Examples

process:
  - whitespace_normalization_mapper:

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment